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INTRODUCTION 

Field of the Invention: 

This invention relates to the area of bioinformatics, more specifically to methods for analyzing the 
sequences of evolutionarily related proteins, and most specifically for identifying evolutionary and 
functional relationships between proteins and the genes that encode them. 

Background 

Proteins are linear polypeptide chains composed of 20 different amino acid building blocks. 
Determining the sequence of amino acids in a protein is now experimentally routine, both by direct 
chemical analysis of the proteins themselves, and by translation of genes that encode proteins. Genome 
sequencing projects provide those genes in abundance, and the full complement of genes is known for 
many microorganisms, yeast, the worm C elegans, and the fly Drosophila melanogaster. The size of 
protein sequence databases will grow explosively over the next decade as more genome sequencing 
projects are completed. 

While genomic sequence data are widely believed to hold the key to a revolution in biology, and while 
their impact on experimental work is already being felt, much of the revolution has still not materialized. 
Missing in particular are bioinformatics tools that extract information about biological function starting 
from genomic data and existing biological information, in a form that can be used by biomedical 
researchers. This makes it difficult to asign a "utility" to a gene sequence, in turn making it difficult to 
capture commercial value from gene sequence data. This issue is found throughout the commerce of 
modern genomics, including in the patenting of genes. It is still not clear what standard must be applied to 
establish that a gene patent application meets the "utility" requirement of the code, but information 
concerning the biological function of the gene for which a patent is sought is almost certainly helpful. 
Tools that generate testable hypotheses about the biological function of genes that encode proteins have 
utility precisely because they enable genome sequences to be exploited commercially. 
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Once those tools are in place, it should be possible to mine genomic sequence data for information 
about pathways [Mar99], generate insights into complex biological function (e.g. development) [RubOO], 
and perhaps even identify pharmaceutical targets [PolOO]. Clearly, the payoff from a tool that could 
provide biological information from sequence data could be enormous. Experiments are expensive; 
genomic data are not (or, more precisely, their cost has already been amortized). 

For this reason, functional bioinformatic tools need not be generate "indisputable conclusions" for 
them to have a broad utility. The payoff would be substantial if the tool did nothing more than suggest a 
hypothesis about the function of a gene, or rule out some fraction of a protein family as relevant for a 
function, both ways of targeting subsequent experiments. Modern biomedical research, especially that 
directed towards the development of therapeutic agents, is largely a random walk in any case. Functional 
bioinformatic tools would be valuable if they do nothing more than favorably bias that walk. 

Generating hypotheses concerning possible biological functions that a gene might have is part of the 
"annotation problem" in modern genomics. "Annotation", and (indeed) "function", mean very different 
things to different people. Much of the literature regards "function" as equivalent to "behavior", what is 
measured in the laboratory. Crystallographers occasionally view "function" as equivalent to "structure". 
Throughout this proposal, we distinguish between "homology" (relationship by common ancestry), 
"structure" (at two levels, as known to those skilled in the chemical arts, "constitution", meaning 
"sequence", and "conformation", which is commonly referred to as "the fold" or, incorrectly, "the 
structure" by structural biologists), "behavior" (what is measured in the laboratory), and "function". Under 
Darwinian theory, "function" refers to adaptive behavior, properties that confer fitness, the ability of an 
organism to survive and reproduce. The Darwinian paradigm holds that the only way to achieve function 
is by random variation of genetic structure (mutation) followed by natural selection. 

It is important to keep these distinctions clear. A statement concerning the function of a gene is 
ultimately a statement about how that gene contributes to the fitness of the host organism. For this reason, 
homologous proteins in different species generally do not have "the same function", as different species 
have different requirements for fitness. They may, however, have "analogous" functions. As we disclose 
below, even subtle differences in function between orthologous proteins in different organisms can be 
very interesting, and can be the key to delivering a true understanding of "function" to biological and 
biomedical research scientists. Therefore, tools that suggest (again, as a hypothesis) that the function of 
two homologous genes might be different are frequently as useful as those that suggest that the two genes 
have analogous functions. 

In any case, the fact that Darwinian processes have generated the genes through natural selection 
acting on the encoded proteins means that an evolutionary analysis must at some point be involved in any 
annotation procedure; an evolutionary analysis is necessary to analyze for function. Further, as we shall 
disclose, an evolutionary analysis is sufficient to generate hypotheses concerning function. As with any 
hypothesis in science, functional hypotheses cannot be "proven". Rather, as with any hypothesis in 
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science, they form part of a web of conjecture, hypothesis, data, and interpretation that builds a useful 
picture of function; utility comes long before "proof, in part because M proof 1 is not possible for any 
substantive statement about the natural world. 

The most common way in which evolutionary analysis is used today to annotate sequences involves 
pairwise sequence comparisons to detect homology between two proteins. The conventional recipe for 
inferring the "function" of a target open reading frame (the target sequence) using evolutionary analysis 
follows five steps: 

1. Use the target sequence as a probe in a BLAST [Alt90] or FASTA [Lip85] search of the Genbank 
database (or an equivalent). 

2. Identify "hits", proteins in the database whose sequence resembles that of the target. 

3. Evaluate the hits based on a statistical model." 

4. Download the annotation of the statistically best "hits" that have functional annotation of their own. 

5. Infer that the function of the target protein is the same as the function of the best protein hit 

Those annotating genomic sequences using this recipe recognize several obvious limitations to the 
pairwise analysis approach to annotation. Commonly encountered problems include: 

(a) The BLAST server returns no hits. 

(b) The BLAST server returns sequences of possible homologs, but with similarity scores too low to be 
certain that the sequence found is indeed a homolog. 

(c) The homologous sequences that the server returns have no annotation indicating function. 

The consequences of these problems are mentioned in most contemporary reviews of genomics. 
Because of the limitations of this recipe, some 40% (depending on the details of the homology search) of 
the proteins in a typical genomic sequence have not been reliably assigned any function. It is well 
recognized that this arises because as powerful as it is, BLAST cannot detect homologs after their 
sequences have diverged far into the "twilight zone" of sequence similarity, defined (arbitrarily) by 
Doolittle as less than 20% sequence identity [Doo87]. For this reason, tools that detect more distant 
homology are being actively sought. 

To solve this "no interesting hit" problem, many workers have attempted to extend the power of the 
pairwise alignment tool to detect increasingly distant protein sequences. Pearson for example recently 
developed statistical parameters from the distribution of similarity scores from thousands of unrelated 
sequences to gain an estimate of the statistical significance that can be used to infer sequence homology 
[Pea98]. BLAST has been improved by transformation into PSI-BLAST [Alt97]. 

Other approaches have been based on the fact that proteins diverging under functional constraint can 
retain their core folded structure long after sequence similarity has vanished [Ros75]. This makes possible 
the detection of distant homology by comparison of predicted structures. In Serial No. 07/857,224, filed 
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March 25, 1992, and issued as US Patent 05958784 on September, 28, 1999, a partent of this application, 
we disclosed the first useful tools to apply this approach. Many others have followed our lead, including a 
particularly interesting study by Barton and his coworkers [Rus96], and we have recently published a 
review article showing various successes of this approach [Ben97]. 

Through their ability to detect distant homologs of proteins whose functions are known, the tools 
disclosed in Serial No. 07/857,224 have proven to be quite useful because they generate hypotheses about 
proteins with unknown function. For example, these tools were applied to the heat shock protein 90 
(HSP90) family, for which no member had an assigned function. A model for the conformation of the 
protein was built for HSP90 as part of the CASP2 prediction contest [Ger97]. The conformation model 
was recognizably similar to the fold for the N-terminal ATP binding domain of gyrase. This generated the 
hypothesis that HSP90 and gyrase were distant homologs. This generated the hypothesis that HSP90 
bound ATP as it contributed to the fitness of its host organisms. This hypothesis contradicted 
experimental papers available at the time claimed that HSP90 did not bind ATP [Jac96]). The prediction 
was correct; the experimental papers were incorrect [Pro97]. And this success was recognized both by the 
CASP2 judges and the group that solved the crystal structure of HSP90, who wrote: 

"The tertiary fold of Hsp90 N-domain has a remarkable and totally unexpected similarity to the N-terminal ATP- 
binding fragment of ... DNA gyrase B protein. This similarity was not initially recognized by the authors of either 
the human or yeast structures but was determined [by Gerloff and Benner] within the CASP2 structure prediction 
competition. Our observation of specific ADP/ATP binding to Hsp90 completely contradicts the careful and widely 
accepted biochemical analysis of Jakob et al. (1996) who demonstrated that Hsp90 could not be photolabelled with 8- 
azidoATP, was not retained on C8 agarose, and did not enhance the fluorescence of MABA-ADP." [Pro97] 

A year later, the tools disclosed in Serial No. 07/857,224 were used to analyze functional and 
structural relationships for ribonucleotide reductase [Tau97]. Other examples in our laboratory and 
elsewhere are summarized in [Ben97]. These examples showed how very distant homolog detection had 
utility in annotating gene sequences, simply by generating experimentally testable hypotheses concerning 
their physiological function. 

The pairwise alignment approach to annotation has become dominant in contemporary genomics. 
Indeed, many patent applications for individual gene sequences are being submitted to the PTO where the 
sole argument for utility is based on the statement that the gene being patented is homologous to gene X, 
that homologous genes have analogous functions, and that this implies that the gene being patented and 
gene X have the same function. The logic here is outlined below: 

(a) Sequence similarity between two proteins implies that the two proteins are homologous (share 
common ancestry). 

(b) Homologous proteins have analogous conformations (folds). 

(c) Analogous folds implies analogous behaviors (what is measured in the laboratory). Thus, 
homologous proteins bind analogous ligands, catalyze analogous reactions, and have analogous physical 
properties. 
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(d) Homologous proteins have analogous function, that is, they contribute to survival in analogous 
ways. 

These assumptions constitute the "homology implies analogous function" (fflAF) logic. It is widely 
regard as the foundation for annotation (see, for example, Skolnik's contribution in this area [Fet98]). 

Element (a) has sound statistical basis, at least within the context of a particular evolutionary theory. 
Element (b) is known from empirical analysis to be generally true, provided that the two proteins have 
diverged under functional constraints. Elements (c) and (d) are, however, are certainly not universally true, 
and may not be true in general. Frequently, new function is generated in biological systems by 
recruitment of a protein that performs a different function. 

Already in 1988, well before the age of the genome, it was well known that the logic otlined above was 
not reliable, because of frequent recruitment in the biological world. The Applicant himself reviewed 
examples where conventional logic would provide deceptive annotation [Ben88]. Four examples illustrate 
how severely elements (c) and (d) of the logic can be violated. 

The first is chosen from eubacterial enzymology, and relates to three enzymes playing three distinct 
roles in microbial metabolism, fumarase (in the citric acid cycle), aspartase (involved in amino acid 
degradation), and adenylosuccinate lyase (essential for nucleic acid biosynthesis). The three proteins are 
clearly recognizable as homologs. Their sequences share statistically significant similarity, as illlustrated 
for the following three excerpts: 

LPENEPGSSMPGKVNFI^ fumarase 

LPELQAGSSIMPAKVNPVVP aspartase 

FEKDQIGSSAMPYEKNPMRS adenylosuccinate lyase 

The overall folded form of the three proteins is the same (an 8-fold alpha-beta barrel). They catalyze 
reactions that, at least at a mechanistic level, have some degree of analogy. From a biologist's perspective, 
however, they have very different functions. The HIAF logic used by virtually every genome annotation 
tool would be deceived by this family. 

The second example is from metazoan biology, and involves the family of proteins known as the src 
homology 2 (SH2) domains. SH2 domains are clearly all homologous. The proteins all have analogous 
folds and analogous behaviors; they all bind to a polypeptide that carries a phosphotyrosine. But they 
bind different peptide sequences flanking the phosphotyrosine. For this reason, they have very different 
functions, as the biologist defines it (and as it is defined throughout this proposal, the behavior that 
confers survival value). Some SH2 domains are in viruses, and regulate viral growth. Some participate in 
the immune response. Others are involved in the regulation of division of non-immune cells. For virtually 
any practical purpose (pharmaceutical target identification, for example), the analogies between the 
behaviors of different SH2 domains are less important than the differences in their function. 
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The third example used covarion behavior to detect changing function in elongation factors, proteins 
whose sequences are so highly conserved that there is little difficulty in recognizing homologs, even in the 
three kingdoms of life. If any protein "has the same function" in different species in orthologous form, it 
is elongation factors, we reasoned. Even so, covarion behavior clearly showed that different (presumably) 
orthologous forms of the protein had subtly different behaviors that indicate subtle differences in 
function. Coupling the evolutionary insights based on a sophisticated, mathematically detailed, 
evolutionary analysis to structural biology even identified residues that were important for these 
differences. 

The fourth compares protein serine kinases and protein tyrosine kinases. These families are clearly 
homologous, the latter having been recruited from the former ca. 600 million years ago. The chemist 
would say that both classes of enzyme operate via analogous reaction mechanisms, differing only in the 
source of the oxygen nucleophile in the phosphoryl transfer reaction. The biologist would note, however, 
that the physiological function of the two classes of proteins are greatly different. For any biomedical 
application, the biologist would be correct. The physiologically relevant differences in behavior, central to 
the understanding of biological function (phosphorylation on tyrosine versus phosphorylation on serine) 
cannot be inferred for one family from the other using the conventional logic. 

If proteins with recognizable (indeed, often high) sequence similarity can have different functions, the 
a fortiori argument tcan be made that surely recruitment is possible for protein homologs with marginally 
significant sequence similarities. This argument suggest that the focus of efforts, including those 
disclosed in Serial No. 07/857,224, might add only some to our ability to provide reliable annotation. 

In reality, both approaches are needed. This application focuses on tools to detect and/or rule out 
recruitment within a protein family. We expect these tools to become increasingly more important as the 
genomes of metazoans are sequenced. It is now clear that the last 500 million years of molecular evolution 
in higher organisms has involved repeated recruitment of existing folds to perform new functions. 

SUMMARY OF THE INVENTION 
As discussed in Serial No. 08/914,375, the parent for the instant application, the physiological 
function of a biomolecule is ultimately determined by the contribution that the biomolecule makes to the 
efforts of the host organism to survive, select a mate (in higher organisms), and reproduce. Determining 
the physiological function of a protein is not trivial, however. Difficulties in establishing physiological 
function are discussed at length by Benner and Ellington [Benner, S. A., Ellington, A. D. Interpreting the 
behavior of enzymes. Purpose or pedigree? CRC Crit Rev. Biochem. 23, 369-426 (1988)]. Still more 
difficult is identifying which behaviors of a protein as measured in vitro are relevant for physiological 
function in vivo. Nevertheless, the identification is important. In vitro behaviors that have relevance to 
physiological function in vivo are those that are interesting to study for biotechnological, biomedical, or 
other applications. There is at present in the art no general method for determining what in vitro behaviors 
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are relevant to in vivo function. Processes for determining these behaviors were claimed in the parent 
application (Serial No. 08/914,375). 

A method for making a model for the folded structure of a set of proteins from an evolutionary 
analysis of a set of aligned homologous protein sequences was claimed in Serial No. 07/857,224. The 
instant application concerns methods for using these models. The first method is used to confirm or deny 
a hypothesis that two proteins are homologous, and is comprised of comparing a predicted structure 
model for one family of proteins with a predicted structure model for a second family of proteins, or an 
experimental structure for the second family, and deducing the presence or absence of homology based on 
the presence or absence of structural similarity flanking key residues in the polypeptide sequence. The 
second method identifies mutations during the divergent evolution of a protein sequence that are 
potentially adaptive by identifying episodes during the divergent evolution of a family of proteins where 
there is a high absolute rate of amino acid substitution, or a high ratio of non-silent substitutions to non- 
silent substitutions. Amino acids that are changing during this episode are likely to be adaptive. The third 
is a method for identifying specific in vitro properties of the protein that are likely to play a physiological 
role in vivo in an organism. This methods involves synthesizing in the laboratory proteins having the 
reconstructed amino acid sequences of a protein before and after a period of rapid sequence evolution that 
characterizes adaptive substitution, measuring the in vitro properties of the protein before the episode of 
rapid sequence evolution, and then measuring the in vivo properties of the protein after the episode of 
rapid sequence evolution. The in vitro behaviors that remained unchanged through this episode are not 
likely to have adaptive significance physiologically. The in vitro behaviors that changed through this 
episode are likely to have adaptive significance physiologically. The fourth concerns method for 
organizing genome sized sequence databases. 



BRIEF DESCRIPTION OF THE DRAWINGS 

Drawing 1. The three elements modeling the evolutionary history of the leptins, proteins from the 
"obesity gene" identified by genetics experiments in mice. Homologs are found in other mammals 
(including human), (a) An evolutionary tree showing the pedigree of each leptin family member, (b) A 
part of the multiple alignment, showing the genetic relationship of amino acids in the protein sequence. 
The reconstructed ancestral sequence from the (now extinct) ancestor of humans, rodents, and ruminants 
(marked "X") is shown in the alignment. The sequence as shown here is deterministic; in the work to be 
performed here, the ancestral sequences are all probabilistic (see text) 

080 090 100 110 120 

I • I - I I ■ I 

RNVIQISNDLENLRDLLHVLAFSKSCHLPWASGLETLDSLGGVLEASGYS human 

RNMIQISNDLENLRDLLHVLAFSKSCHLPWASGLETLDSLGGVLEASGYS chimp 

RNMIQISNDLENLRDLLHVLAFSKSCHLPWASGLETLDSLGGVLEASGYS gorilla 

RWIQISNDLENLRDLLHVLAFSKSCHLPWASGLETLDRLGGVLEASGYS orangutan 

RWIQISNDLENLRDLLHLLAFSKSCHLPLASGLETLESLGDVLEASLYS rhesus 

QNVLQIAHDLENLRDLLHLLAFSKSCSLPQTRGLQKPESLDGVLEASLYS rat 



7 



Steven A. Benner 



QNVLQIAHDLENLRDLLHLLAFSKSCSLPQTRGLQKPESLDGVLEASLYS 
QNVLQIANDLENLRDLLHLLAFSKSCSLPQTSGLQKPESLDGVLEASLYS 



mouse 



rat 



RNVIQISNDLENLRDLLHLLASSKSCPLPQARGLETLESLGGVLEASLYS 



ancestor X 



RNVIQISNDLENLRDLLHLLASSKSCPLPQARALETLESLGGVLEASLYS 
RNVIQISNDLENLRDLLHLLAASKSCPLPQVRALESLESLGWLEASLYS 
RNWQISNDLENLRDLLHLLAASKSCPLPQVRALESLESLGWLEASLYS 



pig 
sheep 



ox 



RNWQISNDLENLRDLLHLLASSKSCPLPRARGLETFESLGGVLEASLYS 



dog 



Drawing 1. xxx Evolutionary tree showing the evolutionary history of the leptins. Heavy lines show 
branches with expressed/silent ratios higher than 2. Hatched lines show branches with expressed/silent 
ratios from 1 to 2. Dotted lines show branches with expressed/silent ratios less than 1, or indeterminate. 
Numbers on the lines indicate the ratio of expressed/silent changes for that branch. An "x" at the end of a 
branch signifies that a sequence for the protein is available in the database. Accoprding to the method of 
the instant invention, a correlation between the episode of high sequence evolution and the 

Drawing 2, Evolutionary tree showing the evolutionary history of the leptin receptors. Heavy lines 
show branches with expressed/silent ratios higher than 2. Hatched lines show branches with 
expressed/silent ratios from 1 to 2. Dotted lines show branches with expressed/silent ratios less than 1, or 
indeterminate. Numbers on the lines indicate the ratio of expressed/silent changes for that branch. An "x M 
at the end of a branch signifies that a sequence for the protein is available in the database. 



Drawing 3. An example of homoplasy taken from the evolution of alcohol dehydrogenase from yeast 
(position 30). At at least three points in the tree, a P->A substitution occurred independently. 



Drawing 4. For 17 vertebrate aromatases, an unrooted evolutionary tree built by a Darwin-based 
(Gonnet & Benner, 1991) based on an analysis of amino acid sequences. Numbers on the branches are 
the K a /K s ratios evaluated using the methods of Fitch (1971) to reconstruct intermediate evolutionary 
states and Li et al. (1985). The key is given below, togetherwith the multiple sequence alignment used to 
calculate the tree. 



1. Tilapia nilotica (rainbow trout), GenBank gl613859, mRNA (Chang et al, 1997) 

2. Oryzias latipes (medaka), GenBank gl786171, ovarian follicle mRNA (Tanaka et al., 1995) 

3 . Danio rerio (zebrafish), GenBank g2306966 aromatase mRNA 

4. Carassius auratus (goldfish) ovary, GenBank g2662330, ovarian mRNA 

5. Ictalurus punctatus (channel catfish), GenBank g9 12802 (Trant, 1994) 

6. Carassius auratus (goldfish) brain, GenBank g2662328, brain mRNA 

7. Sus scrofa (pig) placental, isoform 2, GenBank gl762232, mRNA (Choi et al., 1997a) 

8. Sus scrofa (pig) embryo, isoform 3, GenBank gl244543, mRNA (Choi et al., 1996) 

9. Sus scrofa (pig) ovary, isoform 1, GenBank gl928957, mRNA (Conley et al., 1997) 

10. Bos taurus (ox), GenBank g665546, mRNA (Hinshelwood et al, 1993) 

1 1 . Equus caballus (horse), GenBank g2921277, mRNA (Boerboom et al. 1997) 

12. Mus musculus (mouse), GenBank g3046857, mRNA (Terashima et al. 1991) 

13. Rattus norvegicus (rat), GenBank g203804, mRNA (Hickey et al, 1990) 
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14. Oryctolagus cuniculus (rabbit), GenBank gl240042, mRNA (Delarue et al, 1996) 

15. Homo sapiens (human), GenBank g28846, mRNA (Harada, 1988) 

16. Gallus gallus (chicken), GenBank g21 1703 (McPhaul et al., 1988) 

17. Poephila guttata (zebra finch), GenBank g926845, ovary mRNA (Shen et al, 1994) 

010 020 030 040 050 060 070 080 

I I I I I I I I 

1 MVLEMLNPMHYKVTSMVS EWPFAS I AVLLJLTGFIjLLWnsr^KNTS - S I PGPGYFLGIGPL I SYLRFLWMGIGSACNYYNK 

2 MFLEMLNPMQYNVT IMVPETVTV SAMPLLL IMGLLLL IWNCE S S S - S I PGPGYCLGIGPL I SHGRFLWMGIGSACNYYNK 

3 MILEMLNPMHYlJLTSWPEvMPVATL P I LLLTGFLFFVWNHEETS - S I PGPGYCMGIGPL I SHLRFLVMGLGSACNYYNK 

4 VTiELLMQGAHNSSYGAQDWCGAMATLLLLLLCLLIiAIM 

5 -MEEVLKGTVKFAATOQVTLiMALTGTLLL I LLJIRIFTAKISIWRNQS -GVPGPG^LGIKSP IMSYSRFLWMGIGSACNYYNE 

6 WDLLIQRAHNGTERAQDI^CGATATILLLLL^ 

7 MVLEMLNPMYYKITSWSEWPFASIAVLLLTC^ 

8 LVSIAPtTIWGLP-SGIPMATRSLILLVCLLIiMVWSHSEKK-TI 

9 MVLEMLNPMN — ISSWSEAVLFGSIAILLLIGLLLWV^ 

10 MVLEMLJ^PMHFN 

11 VME ILLREARNGTDPRYENPRG- ITLLLLLCLVLLLTVW3SIRHEKKC S I PGP SFCLGLGPLMSYCRF IWMGIGTASNYYNE 

12 MVLETMPLHYNITSLVPDTMPVAT^ 

13 WAE^LCDLKCHPIIXjISMATRTLILLVCLLLVAWSHTDKK- IVPGPSFCLGLGPLLSYLRF IWTGIGTASNYYNN 

14 MLLEVLNPRHYIWTSWSEVVPIASIA^ 

15 MVIjEMLNP IHYNIT S ri/PEA^AATMFVLLLTGLFLLVWNYEGT S - S I PGPGYCMG IGPL I SHGRFLWMGIGSACNYYNR 
16 MPVATVPI I ILICFLFLIWNHEETS- S I PGPGYCMG IGPL I SHGRFLWMGV^ 

17 MFLEMLNPMHYNVTIMVPETV^ I PGPGYCLGIGPL I SHGRFLWGIGSACNYYNK 

090 100 110 120 130 140 150 160 

I I I I I I ! I 

1 TYGEFIRWIGGEETLIISKSSSVFHVMKHSHYTSRFGSKPGLQF 

2 MYGEFMRVWISGEETLI I SKSSSMFHVMKHSHYISRFGSKRGLQCIGMHEN^ 

3 MYGEFVRVWI SGEETLVI SKSSSTFHIMKHDHYSSRFGSTFGLQYMGMHENGVIFN1M 

4 KYGDIVRWINGEETLI L SRSSAVYHVIjRKSLYTSRFGSKLGLQC IGMHEQGI IFNSWALWKKVRTFYAKALTGPGLQR 

5 KYGSIARWISGEETFILSKSSAVYHVLKSNNYTORFASKKGLQC IFNSNMALWKKVRTYFTKALTGPGLQK 

6 KYGDIWWINGEETLILSRSSAVYHVLRKSLYTSRFGSKLGLQCIGMHEQGI IFNSWALWECKVRAFYAKALTGPGLQR 

7 MYGEHMRVWIGGEETLI I SKSSSWHVMKHSHYTSRFGSKPGLECIGMYEKGI IFNNDPALWKAWTYFMKALSGPGLVR 

8 KYGDWRWINGEETLILSRASAVHHVLKNRIOT IFNMJVALWKKIRTYFTKALTGPNLQQ 

9 MYGEFMRVWIGGEETLI I SKSSSIFHIMKHNHYTCRFGSKLGLECIGMHEKGI 

1 0 MYGEFIRWICGEETLI I SKSSSMFHVMKHSHYVSRFGSKPGLQCIGMHENGI IFNNNPALWKVWPFFMKALTGPGLVQ 

11 KYGDMVRWISGEETLVLSRPSAVYHVLKHSQYTSRFGSKLGLQC^ 

12 TYGDFVRVWI SGEETFI I SKSSSVSHVMOiWHWSRFGSKLGLQCIGMYENGI IFNNNPAHWKEIRPFFTKALSGPGLVR 

13 KYGDITOWINGEETLILSRSSAVHHVLKNGNYTSR^ 

14 MYGEFMRVWCGEETLIISKSSSMFHVMKHSHYISRFGSKLGLQFI^^ 

15 WGEFMRWISGEETLIISKSSSMFHIMKHNHYSSRFGSK^ 

16 TYGEFVRVWISGEETFI I SKSSSWHVMKHWNYVSRFGSKLGLQC IGMYENGI IFN1SINPAHWKEIRPFFTKALSGPGLVR 

17 MYGEFMRVWISGEETLI I SKSSSMVHVMKHSNYISRFGSKRGLQCIGMHENG 

170 180 190 200 210 220 230 240 

I I I I I I I I 

1 MVWCADSITKHLDKLEEVRI^^ 

2 MVEVCVES IKQHUDRLGEWDTSGYVDVLTLMRHIMLDTSNMLFLG I PLDE SAIVKKIQGYFNAWQALL IKPNIFFKI S - 

3 MVWCT^SVNNHLDRI^ PLDEKNIVLKIQGYFDAWQALL IKPNIFFK I s - 

4 TLEICITSTNTHLDNLSHLMDARGQTO 

5 SVDVCV SATNKQLNVLQEFTDHSGHVDVIJSJLLRC IWDVSNRLFLRI PLNEKDLL IKIHRYFSTWQAVL IQPDVFFRLN- 

6 IMEICTTSTNSHLDDLSQLTDAQGQLDILNLLRCIVVDVSNR^ 

7 MVWGADSITKHLDKLEEVRND^ - 

8 TVEVCVTSTQTHLDNLSSL SYVDVLGFLRCTVVDI SNRLFLGVPVDEKELLQKIHKY^ - 

9 MVWCADSITKHLDKLEEVRNDLGYTOVLTLt^ 

10 MVAICVGSIGRHLDKLEEVTTRSGCTO 

11 TLEICTMSTOTHLDGLSRLTDAQGHVDVimLRCIVVDIS 

12 MIAICVESTTEHLDRLQEVTTELGNINAL 

13 TVDVCV SSI QAHLDHLDSL GHVDVLJ^LRCTVLDI SNRIF^ IK PD I YFKFR- 

14 MVT ICADS I TPCHLDRLEEVCNDLGYVDVLTLMRRIMLD^ PLDESAIWNIQGYFDAWQALLLKPD I FFKI S - 

1 5 MVWCAESLKTHLDRLEEVTINESGYV]^ 

1 6 MIAIC^STIVHLDKLEEOTTEVGNVN^ 

17 MVWC^SIKQHLDRLGDVTDNSGYVDVVTLMRHI^ 
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250 260 270 280 290 300 310 320 

I I I I ! I I I 

1 WLYRKYEKSVKDLKEDME IL I EKKRRRI FTAEKLEDCMDFATEL ILAEKRGELTKENVNQC ILEML I AAPDTMSVTVFFM 

2 WLYRICYERSVKDLKDE I AVL\^KKRHKVSTAEKLEDC14DFATDL IFAERRGDLTKENVNQC ILEML I AAPDTMSVTLYFM 

3 WLSRKHQKS IKELRDAVG ILlAEEKRHF.IFTAEKLEDH^/DFATDL ILAEKRGELTKENVNQC ILEMMI AAPDTLSVTVFFM 

4 WLHGKHKTOAQELQDAIAALIEQKRVQLTRAEKFDQ-LDFTGELIFAQS IAAPDTLS I SLFFM 

5 FVYKKYHLAAKELQDEMGKLVEQKRQAIN^ 

6 WLHRKHKRDAQELQDAITALIEQK^ 

7 WLYKKHKESVKDLKEN^ 

8 WIHQRHKTAAQELQDAIESLVEKKRKEMEQAEKLDN- INFTAEL IFAQGHGEL SAENVRQCVLEMVI AAPDTLS I SLFFM 

9 WLYRKYEKSVKDLKDAME IL IEEKRHRI STAEKLEDSMDFTTQL IFAE2CRGELTKENTO I TVFFM 

10 WLYKKYEKSVKDLKDAIDILVEKXRRRISTAEKLEDHMDE 

11 WLHDKHRMAAQELHDAIEDLIEQKRTELQQAEKLI^ 

12 V^CKKYKDAVKDLKGAME IL IEQKRQKL STVEECLDEHMDFASQL IFAQNRGDLTAENVNQCVLEMMI AAPDTL SVTLFFM 

13 WIHHRHKTATQELQDAIi<ELVDQKRK]NIMEQADKLDN- INFTAEL IFAQ3SIHGELSAE3S1VTQCVLEMVI AAPDTLSLSLFFM 

14 TATLCRKYEKSVKDLKDAMEIL IAEKRHRI STAEKLEDS IDFATEL IFAEKRGELTRENVNQC ILEMLIAAPDTMSVSVFFM 

15 WLYKICiTEKSVKDLKDAI EVL IAEKRRRI STEEKLEECMDFATEL I LAEKRGDLTRENVNQC ILEML I AAPDTMSVSLFFM 

1 6 WLCKKYEEAAKDLKGAME IL IEQKRQKL STVEKLDEHMDFASQL IFAQNRGDLTAEIWNQC^ SVTLFIM 

17 WLYRKYERSVKDLKDE IE I LVEKKRQKVS SAEKLEDCMDFATDL IFAERRGDLTKENVHQC ILEML I AAPDTMSVTLYVM 

330 340 350 360 370 380 390 400 

I I I I I I I I 

1 LFL I AKHPQVEEELMKE I QTWGERD I R£HX)MQKLEWENF I YE SMRYQ PWDLVMRKALEDDVIDGYPVKKGTNI I LN I 

2 LLLVAEYPEVEAAI LKE IHTWGDRD IKI EDIQNLKWENF INE SMRYQPWDLVMRRALEDDVIDGYPVKKGTNI I LNI 

3 LCLIAQHPKVEEALMKE IQTVLGERDLKSfDDMQKLKVMENF INESMRYQPVVP rVMRKALEDDVIIX^YPVKKGTNI ILNI 

4 LLLLKQNPDVELKILQEM^IAVLAGRSLQHSHLSGLHILESF IISESIjRFHPVVDFTMRRALDDDVIEGYEVKKGTNI ILNV 

5 LLLLKQNSWEEQIVQE I QS Q IGERDVESADLQKLNVLERF IKE SLRFHPWDF IMRRALEDDE IDGYRVAKGTNL I LNI 

6 LLLLKQNPDVELKILQEMDSVLAGQSLQHSHLSKLQILESF IISESLRFHPWDFTMRRALDDDVIEGYNVKKGTNI ILNV 

7 LFL IAKHPQVEEAI VKE I QWIGERD IRNDDMQKLKVVENF I YE SMRYQPWDLVMRECALEDDVI DGYPVKKGTNI ILNI 

8 LLLLKQNPHVELQLLQEIDT WGDSQLQNQDLQKLQVLESFINECLRFHPVVDFTMRRALFDDI IDGHRVQKGTNI ILNT 

9 LFLIANHPQVEEELMKEIYTWGERDIRNDDMQKLKVVENFIY^ 

10 LFLIAKHPSVEEAIMEEIQTWGERDIRIDDIQKLE^/VENF I YESMRYQPVVDLVMEIKALEDDVIDGYPVKKGTNI ILNI 

11 LLLLKQNAEVERRILTE IHTVLGDTELQHSHLSQLHVLECF I NEALRFHPWDF SYRRALDDDVIEGFRVPRGTNI I LNV 

12 LILIAEHPTVEEEMMREIETWGDRDIQSDDMPNLKIVENFIYESMRYQ 

13 LLLLKQNPHVEPQLLQE IDAWGERQLQNQDLHKLQVMESFI YECLSFHPWDFTMRRALSDDI IEGYRI SKGTNI ILNT 

14 LFL IAKHPQVEEAI I RE IQTWGERD I RI DDMQKLKWENF INE SMRYQPVVDLVMKE^A^ 

15 LFL IAKHPNVEEAI I KE I QTVTGERD IK I DDIQKLKVMENF I YE SMRYQPVVDLVMRKALEDDVIIXjYPVKKGTNI ILNI 

16 LILIADDPTVEEKMMREIETVMGDREVQSDDMPNLKIVENFIYESMRYQP 

17 LLLIAEYPEVETAILKE IHTVVGDRDIRIGDVQNLKVVENF INESLRYQPVVDLVMRRALEDDVIIX5YPVKKGTNI ILNI 

410 420 430 440 450 460 470 480 

I I I I I I I I 

1 GRMHRLEFF PKPNEFTLENFAKWPYR- YFQPFGFGPRACAGK^ LLRRFQVQTPQDRCVEKMQKKNDL 

2 GRMHRLEYF PKPNEFTLENFEKNVPYR- YFQPFGFGPRGCAGKY I AMVMMKWLVTLLRRFQVKTLQKRC IENI PKKNDL 

3 GRMHKLEFF PKPNEFTLENFEKNVPYR- YFQPFGFGPRSCAGKF I AMVMMKVMLVSLLRRFHVKTLQGNCLENMQKTNDL 

4 GRMHRSEFFPKPNEFSLDNFQKNVPSR-FFQPFGSGPRSCVGKHIAMVM^ 

5 GRMHKSEFFQKPNEFNLENFENTVTSR-Y^ 

6 GRMHRSEFFSKPNQFSLDNFHKNVPSR-FFQPFGSGPR 

7 GRMHRLEFFPKPNEFTLENFAKNVPYR-YFQPFGFGPRACAGKYIAM^^ 

8 GRMHRTEFFHKANEFSLENFQKNTPRR-YFQPFGSGPRACVGRHIAMW 

9 GRMHRLEFFPKPNEFTLENFAKNVPYR-YFQPFGFGPRACAGKYIAMVMM 

1 0 GRMHRLEFF PKPNEFTLENFAKNVPYR- YFQ PFGFGPRGCAGKY IAMVMMKVILOTLLRRFQVKALQGRSVENIQK^ 

1 1 GPMHRSEFYPKPADFSLDNFNKPWSR-FFQPFGSGPRSCVGKHIAMVMMKA^ 

12 GRMHKLEFFPKPNEFSLENFEKNVPSR-YFQPFGFGPRSCVGCT 

13 GRMHRTEFFLKGNQFNLEHFENNVPRPPTFQPFG 

1 4 GRMHRLEFFPKPNEFTLENFAKNVPYR-YTQPFGFGPRGCAGKYIAMV^ 

15 GRMHRLEFFPKPNEFTLENFAKNVPYR-YFQPFGFGPRGCAGKYIA 

1 6 GRMHKLEFFPKPNEFSLENFEKNVPSR-YFQPFGFGPRGCVGKFIAMVMMKAI 

17 GRMHRLEYFPKPNEFTLENFEKNVPYR- YFQPFGFGPRSCAGKF 

490 
I 

1 SLHPDETSG 
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2 SLHPNEDRH 

3 ALHPDESRS 

4 SQQPVEEPS 

5 SMQPVEEDP 

6 SQQPVEEPS 

7 SLHPDETSG 

8 SQQPVEHHQ 

9 SLHPHETSG 

10 SLHPDETSD 

11 SQQPVEDKH 

12 SMHPIERQP 

13 SQQPVEHQQ 

14 SLHPDETRD 

15 SLHPDETKN 

16 SMHPIERQP 

17 SLHLDEDSP 



Drawing 5. An evolutionary tree built from neutral evolutionary distances (NEDs) calculated by 
assuming a first order approach to equilibrium for codon usage at two fold redundant silent sites. 
Numbers on branches of the tree correspond to evolutionary time (in million years) estimated from the 
NEDs using a first order rate constant for pyrimidine-pyrimidine transitions of 3 x lO 9 changes per base 
per year. 



Drawing 6. The Notch family, with values for each of the internal nodes, and K a /K s values for each of 
the branches. 



DETAILED DESCRIPTION OF THE INVENTION 

This disclosure describes the classes of tools that permit the scientist to generate experimentally 
testable hypotheses concerning the function of a protein starting from an evolutionary analysis. These are 
outlined below: 

L Tools that detect change in function within a family of proteins. 

A. Ratios of silent to non-silent substitution along specific branches of an evolutionary tree 
including tools that address normalization issues. 

B. Covarion behavior, in which individual residues display different mutability in different 
branches of a tree. 

C. Detecting high absolute rates of amino acid substitution, changes per unit time, 
n. Tools that detect conservation of function within a family of proteins. 

A. Compensatory changes 

B. Homoplasy 

C. Absolute conservation within a defined evolutionary distance 

HI. Tools that identify individual residues involved in changes in functionally significant behavior. 

A. Residues changing in episodes with high Ka/Ks values, minus residues changing in episodes 
with low Ka/Ks values 
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B. Residues displaying covarion behavior 

C. Mapping these residues on to models for the secondary, tertiary, and quaternary structure of 
proteins. 

IV. Tools that identify individual residues involved in conserved of functionally significant behavior 

A. Residues suffering compensatory changes 

B. Residues displaying homoplasy 

C. Mapping these residues on to models for the secondary, tertiary, and quaternary structure of 
proteins. 

V. Tools that involve correlation between the evolutionary histories of two families of proteins 

A. Correlating the topology of evolutionary trees in two families of proteins 

B. Correlating the connectivity of proteins in a gene family 

C. Dating events in the molecular history 

D. Correlating evolutionary events in two protein families occuring at approximately the same 
time 

E. Correlating evolutionary events in two protein families that are associated with analogous 
behavior involving expressed/silent ratios 

VI. Tools that involve correlation between the evolutionary history of a family of proteins and the 
evolutionary history of the organism as known from some source other than genomic sequence data, 
including paleontology, geology, ecology, ontogeny, phylogeny, or systematics (collectively known as 
the "non-genomic record". 

A. Correlating the topology of an evolutionary trees and the non-genomic record. 

B. Correlating features of patterns of evolution in specific branches in the evolutionary tree with 
the non-genomic record 

C Correlating evolutionary events in several protein families occuring at approximately the same 
time with the non-genomic record 

Many of these tools are new in this disclosure. Others were disclosed in Serial No. 07/857,224 and 
Serial No. 08/914,375 and are claimed here for the first time. In many cases, elements of novelty and 
utility can be found by combining these tools. This disclosure will systematically indicate the Applicant's 
presently preferred combinations, with statements of where the Applicant believes that the state of the prior 
art requires reference to the priority dates of parent applications, where it does not. 

All of the tools have in common the same starting point, a basic evolutionary model based on three 
parts: 

(a) An evolutionary tree that shows the familial relationship between the members of the protein family, 

(b) A multiple alignment of the sequences of members of the protein family, which shows the 
evolutionary relationship between the individual amino acids in the sequences, and 

(c) The sequences of ancient proteins that were the ancestors of the contemporary proteins in the family. 
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Each element of an evolutionary model requires the other two in the reconstruction process. 
Accordingly, processes for constructing an evolutionary model for a protein family are frequently iterative. 
These processes are well know in the art, and include parsimony tools [Fit67], maximum likelihood tools 
[Gon91][Gon96][Tho92], tools for evaluating the probability of an evolutionary model [Gon96], and 
gamma models [Swo96] [Li97]. 

Serial No. 08/914,375 disclosed the step-by-step procedure in which the basic evolutionary model 
for a family of proteins is constructed to support the tools outlined above. 

(a) A multiple alignment, an evolutionary tree, and ancestral sequences at nodes in the tree are 
constructed by methods well known in the art for a set of homologous proteins. These three elements of 
the description are interlocking, as is well known in the art. The presently preferred methods of 
constructing ancestral sequences for a given tree is the maximum parsimony methods, as implemented 
(for example) in the commercially available program MacClade [W. P. Maddison, D. R. Maddison, 
MacClade. Analysis of Phytogeny and Character Evolution. Sinauer Associates, Sunderland MA (1992)]. 
Alternative methods for reconstructing evolutionary intermediates can now be found with the PAUP 
program [Swofford, D. L., Olsen, G. J., Waddell, P. J., & Hillis, D. M. (1996) Phylogenetic Inference in 
Molecular Systematics (eds. Hillis, D. M., Moritz, C. & Mable, B. K.) 407-514 (Sinauer Assc., Inc., 
Sunderland, MA, 1996)] and using the maximum likelihood method of the PAML program [Yang, Z. H. 
PAML: a program package for phylogenetic analysis by maximum likelihood. Comput AppL Biosci 13, 
555-556 (1997)]. Trees are compared based on their scores using either maximum parsimony or 
maximum likelihood criteria, and selected based on considerations of score and correspondence to known 
facts. Step (a) is part of the process used to generate the predictions of secondary structure using the 
method disclosed in Serial No. 07/857,224. 

(b) A corresponding multiple alignment is constructed by methods well known in the art for the 
DN A sequences that encode the proteins in the protein family. The multiple alignment is constructed in 
parallel with the protein alignment. In regions of gaps or ambiguities, the amino acid sequence alignment 
can be adjusted to give the alignment with the most parsimonious DNA tree. The presently preferred 
method of constructing ancestral DNA sequences for a given tree is the maximum parsimony method. 
The DNA and protein trees and multiple alignments must be congruent, meaning that when amino acids 
are aligned in the protein alignment, the corresponding codons are aligned in the DNA alignment. 
Likewise, the connectivity of the two evolutionary trees must show the same evolutionary relationships. In 
regions where the connectivity of the amino acid tree is not uniquely defined by the amino acid sequences, 
the tree that gives the most parsimonious DNA tree is used to decide between two trees or reconstructions 
of equal value. Finally, the ancestral amino acids reconstructed at nodes in the tree must correspond to the 
reconstructed codons at those nodes. When the ancestral sequences are ambiguous, and where the DNA 
sequences cannot resolve the ambiguity, the reconstructed DNA sequences must be ambiguous in parallel. 
Approximate reconstructions are valuable even when exact reconstructions are not possible from available 
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data, and the tree is preferably constrained to correspond to evolutionary relationships between proteins 
inferred from biological data (e.g., cladistics). 

(c) Mutations in the DNA sequences are then assigned to each branch of the DNA evolutionary tree. 
These may be fractional mutations to reflect ambiguities in the sequences at the nodes of the tree. When 
ambiguities are encountered, alternatives are weighted equally. Mutations along each branch are then 
assigned as being "silent", meaning that they do not have an impact on the encoded protein sequence, and 
"expressed", meaning that they do have an impact on the encoded protein sequence. Fractional 
assignments are made in the case of ambiguities in the reconstructed sequences at nodes in a tree. 

As disclosed in Serial No 08/914,375, the quality of a multiple alignment and the precision of the 
reconstructed ancestral sequences decreases if proteins are included in the family with sequences 
diverging by over 150 PAM units, where a PAM unit is the number of point accepted mutations per 100 
amino acids. For this reason, families are most preferably constructed with a tree "width" (the distance 
between the two most divergent proteins in the family) of 150 PAM units or less. Some variation is, of 
course, desired. Therefore, the PAM width of the tree is preferably more than 50 PAM units. Also 
referred are well articulated trees. In principle, the more sequences in the tree, the more valuable an 
evolutionary analysis of the tree becomes. 

With the emergence of massive amounts of sequence information as a result of genome projects, the 
ability to construct detailed evolutionary histories of protein families will increase. This will make the 
inventions disclosed herein of still greater value, as is appreciated by one of ordinary skill in the art. 

One key inventive feature of Serial No 07/857,224 was that an evolutionary analysis had additional 
value when placed within well defined. One key inventive feature of Serial No 08/914,375 was that an 
evolutionary analysis gained additional value when it involved analysis of explicitly reconstructed 
intermediates in the evolutionary tree. These inventive concepts are at the core of all of the tools outlined 
above. 

Another key inventive feature of Serial No 08/914,375 was that an evolutionary analysis gained 
additional value when it is correlated with the non-genomic record. This inventive concepts is at the core 
of all of the tools in class VI outlined above. 

Another key inventive feature of Serial No 08/914,375 involved the use of a natural organization to 
generate a rapidly searchable database. As disclosed in the specification to Serial No 08/914,375, when all 
of the genomes of all of the organisms on planet Earth are completed, all protein sequences will be easily 
recognizable as members of one of ca. 10,000-100,000 nuclear families, protein sequence modules 50- 
500 amino acids long that are related by common ancestry. This conclusion reflects the well known fact 
that all organisms on the planet are descendants of a single ancestor. In the course of producing the 
diversity of organisms now on Earth, divergent evolution also produced the diversity of molecular genetic 
sequences within nuclear families. 
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As disclosed in the specification to Serial No 08/914,375, this permits a naturally organized database. 
The ancestral sequences and the predicted secondary structures associated with the families are surrogates 
for the sequences and structures of the individual proteins that are members of the family. The 
reconstructed ancestral sequence represents in a single sequence all of the sequences of the descendent 
proteins. The predicted secondary structure associated with the ancestral sequence represents in a single 
structural model all of the core secondary structural elements of the descendent proteins. Thus, the 
ancestral sequences can replace the descendent sequences, and the corresponding core secondary 
structural models can replace the secondary structures of the descendent proteins. 

This makes it possible to define two surrogate databases, one for the sequences, the other for 
secondary structures. The first surrogate database is the database that collects from each of the families of 
proteins in the databases a single ancestral sequence, at the point in the tree that most accurately 
approximates the root of the tree. If the root cannot be determined, the ancestral sequence chosen for the 
surrogate sequence database is near the center of mass of the tree. The second surrogate database is a 
database of the corresponding secondary structural elements. The surrogate databases are much smaller 
than the complete databases that contain the actual sequences or actual structures for each protein in the 
family, as each ancestral sequence represents many descendent proteins. Further, because there is a limited 
number of protein families on the planet, there is a limit to the size of the surrogate databases. Based on 
our work with partial sequence databases [Gonnet et al., op. cit. 1992], we expect there to be fewer than 
10,000 families as defined by steps (a) through (e). 

Searching the surrogate databases of the instant invention for homologs of a probe sequence thus 
proceeds in two steps. In the first, the probe sequence (or structure) is matched against the database of 
surrogate sequences (or structures). As there will be on the order of 10000 families of proteins as defined 
by steps (a) through (e) after all the genomes are sequenced for all of the organisms on earth, there will be 
only on the order of 10000 surrogate sequences to search. Thus, this search will be far more rapid than 
with the complete databases. A probe protein sequence (or DNA sequence in translated form) can be 
exhaustively matched [Gonnet et al., op. cit. 1992] against this surrogate database (that is, every 
subsequence of the probe sequence will be matched against every subsequence in the ancestral proteins) 
more rapidly than it could be matched against the complete database. 

Should the search yield a significant match, the probe sequence is identified as a member of one of the 
families already defined. The probe sequence is then matched with the members of this family to 
determine where it fits within the evolutionary tree defined by the family. The multiple alignment, 
evolutionary tree, predicted secondary structure and reconstructed ancestral sequences may be different 
once the new probe sequence is incorporated into the family. If so, the different multiple alignment, 
evolutionary tree, and predicted secondary structure are recorded, and the modified reconstructed ancestral 
sequence and structure are incorporated into their respective surrogate databases for future use. 
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The advantage of this data structure over those presently used is apparent. As presently organized, 
sequence and structure databases treat each entry as a distinct sequence. Each new sequence that is 
determined increases the size of the database that must be searched. The database will grow roughly 
linearly with the number of organismal genomes whose sequences are completed, and become 
increasingly more expensive to search. 

The surrogate database will not grow linearly. Most of the sequence families are already represented 
in the existing database. Addition of more sequences will therefore, in most cases, simply refine the 
ancestral sequences and associated structures. In any case, the total number of sequences and structures in 
their respective databases will not grow past ca. 10000, the estimate for the total number of sequence 
families that will be identifiable after the genomes of all organisms on earth are sequenced. If a 
dramatically new class of organism is identified, this estimate may grow, but not exponentially (as is the 
growth of the present database). 

Since Serial No. 08/914,375 was filed, other databases have emerged that offer some precomputed 
families. Most noteworthy are Pfam [BatOO] and ProDom [CorOO], 

Serial No 07/857,224 disclosed methods to identify residues, secondary structural elements, and 
evolutionary episodes that are involved in functional adaptation 

Further, during episodes of rapid sequence evolution, amino acid substitutions will be concentrated in 
secondary structural elements defined by the method claimed in Serial No. 07/857,224. These are 
secondary structural elements that are important in the acquisition of new function. A general method for 
identifying secondary structural elements that contribute to the origin of new biological function is 
comprised of identifying an element in the predicted secondary structure model where the corresponding 
section of the gene has a high ratio of expressed to silent changes. 

4. Identification of in vitro behaviors that contribute to physiological function. 

In vitro experiments in biological chemistry extract data on proteins and nucleic acids (for example) 
that are removed from their native environment, often in pure or purified states. While isolation and 
purification of molecules and molecular aggregates from biological systems is an essential part of 
contemporary biological research, the fact that the data are obtained in a non-native environment raises 
questions concerning their physiological relevance. Properties of biological systems determined in vitro 
need not correspond to those in vivo, and properties determined in vitro need have no biological relevance 
in vivo. 

To date, there has been no simple way to say whether or not biological behaviors are important 
physiologically to a host organism. Even in those cases where a relatively strong case can be made for 
physiological relevance (for example, for enzymes that catalyze steps in primary metabolism), it has 
proven to be difficult to decide whether individual properties of that enzymes (kcat> K m , kinetic order, 
stereospecificity, etc.) have physiological relevance. Especially difficult, however, is to ascertain which 
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behaviors measures in vitro play roles in "higher" function in metazoa, including development, regulation, 
reproduction, digestion. 

A general method to determine whether a behavior measured in vitro is important to the evolution of 
new physiological function is comprised of the following steps: 

(a) Prepare in the laboratory proteins that have the reconstructed sequences corresponding to the 
ancestral proteins before, during, and after the evolution of new biological function, as revealed by an 
episode of high expressed to silent ratio of substitution in a protein. This high ratio compels the 
conclusion that the protein itself serves a physiological role. 

(b) Measure in the laboratory the behavior in question in ancestral proteins before, during, and after 
the evolution of new biological function, as revealed by an episode of high expressed to silent ratio of 
substitution. Those behaviors that increase during this episode are deduced to be important for 
physiological function. Those that do not are not. 

We now discuss using the basic evolutionary model in the context of tools that generate hypotheses 
concerning function within and between protein families. 
I. Tools that detect change in function within a family of proteins. 

A. Ratios of silent to non-silent substitution along specific branches of an evolutionary tree 
including tools that address normalization issues. 

As discussed in Serial No. 07/857,224, during the divergent evolution of two proteins from a common 
ancestor, mutations of two types accumulate. The first have no impact on the ability of the host organism 
to survive, select a mate, and reproduce; these are called "neutral" mutations. The second influence the 
behavior of the protein in a way that influences the ability of the organism to survive, select a mate, and 
reproduce. These are termed "adaptive mutations." When evolving a new function, proteins undergo an 
episode of rapid sequence evolution that corresponds to adaptive "positive selection", as is well known in 
the art [Kreitman, M, Akashi, H. Ann. Rev. Ecol. Syst. 26, 403-422 (1995)]. 

Given a basic evolutionary model for a protein family, we can begin to search for sequence details that 
are indicative of function. For example, the genetic code is degenerate. Some mutations randomly 
introduced into a genome do not alter the encoded amino acid ("silent mutations"). Others do ("non-silent 
mutations"). When the gene is under no selective pressure at all, it makes no difference to natural selection 
whether the mutation changes an amino acid or not. Thus, mutations at the level of the gene are 
(essentially) neutral, and are fixed in a population without regard to whether they are silent or non-silent. 
The ratio of non-silent to silent changes can be normalized for the number of silent sites in a particular 
sequence to give K a and K s values. 

When the function of a protein is constant, non-silent changes are usually detrimental. Non-silent 
changes are therefore removed by natural selection. Silent changes are not. The K a /K s value is therefore 
lower than unity in a protein divergently evolving under a constant set of functional constraints. Indeed, 
for many proteins with function that has been established early in natural history (such as cytochromes), 
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the ratio approaches zero. At the start of the evolutionary period where the calculation is done, the protein 
is already doing its job nearly optimally, and neither needs nor wants to change its amino acids. 
Conversely, if one reconstructs the evolutionary history of a protein, and identifies an episode in that 
evolution where the non-silent/silent ratio is very much less than one, the genomic analysis suggests that 
the protein has a conserved function during that episode. 

One of ordinary skill in the art will note that this method assumes that codon selection is not strongly 
selected in metazoa. This is not true in eubacteria, or in highly expressed genes in yeast, for example. 
However, there is little evidence in metazoa to suggest that codon usage is strongly selected in 
multicellular plants and animals (metazoa), including mammals, where most of the ORFs needing analysis 
for a developmental biology program are studied. Therefore, the presently preferred scope for methods 
involving the analysis of silent substitutions is in multicellular organisms. 

The exact opposite is the case when new function (implying, of course, new behaviors as well) is being 
engineered into a protein during an episode of evolution. Non-silent changes, those where amino acids are 
replaced at the level of the protein, are the only way to change the behavior of a protein to perform its new 
role. Natural selection desires non-silent changes, as these create new behaviors. The K a /K s value is high. 

The ratio of non-silent to silent changes, normalized for the number of non-silent and silent sites (the 
K a /K s value) was introduced in the 1980s as a way of detecting change in function between proteins at the 
leaves of trees[Li97]. It was applied to a large number of cases (for an example, see [McD91][Jol89]). 
Both the Applicant [Tra96] and Stewart and her coworkers [Mes97] extended this method to analyze 
reconstructed evolutionary events, calculating K a /K s values between ancestral nodes in an evolutionary 
tree, and applied it to individual cases (ribonuclease and lysozyme, respectively). Using this approach, if 
one reconstructs the evolutionary history of a protein, and identifies an episode in that evolution where the 
K a /K s value is greater than unity, the protein is evolving a new function during that episode. 

In practice, K a /K s values are not so easily interpretable. Even when the function of a protein is 
changing, some residues (such as those holding together the fold) cannot change without destroying the 
ability of the protein to serve as a scaffold for function. Thus, the K a /K s value for specific sites can be 
very high during an episode of divergent evolution, perhaps even much higher than unity. But because 
K a /K s values are calculated for the sequence as a whole, the sites undergoing rapid substitution are 
counted with "core" sites undergoing slow substitution, giving a K a /K s value for the protein as a whole of 
less than unity. 

Likewise, K a /K s values are assigned to individual branches of an evolutionary tree. If the evolutionary 
tree is poorly articulated, a single branch may contain both adaptive and conservative episodes of 
evolution. In this case, the high K a /K s value for the adaptive episode may be diluted by a low K a /K s value 
for the conservative episode. The second problem will, of course, subside as more and more genome 
sequence projects are completed. 
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One solution to this problem involves normalization of the Ka/K s values for a protein family. Here, the 
average K a /K s value for the average branch of the tree is calculated. Thos branches that have a K a /K s value 
an arbitrary factor higher (the presently preferred factor is two fold higher) are then hypothesized to be 
undergoing a change in function. More preferably, a statistical analysis is performed where the number of 
sites undergoing changes is determined for each branch length, the average K a /K s value is calculated, a 
statistical model is constructed to assess the distribution of K a /K s values on different branches of the tree, 
and branches that have K a /K s values lying more than two standard deviations above the mean are 
hypothesized to contain a change in function 

Serial No. 08/914,375 discussed in greater detail the tools based on the fact that the genetic code is 
degenerate. More than one triplet codon encodes the same amino acid. Therefore, a mutation in a gene can 
be either silent (not changing the encoded amino acid) or expressed (changing the encoded amino acid). 
Especially in multicellular organisms, and most particularly in multicellular animals (metazoa), silent 
changes are not under selective pressure. In contrast, expressed changes at the DNA level, by changing the 
structure of the protein that the gene encodes, change the property of the protein. 

When examining a protein from higher organisms during a period of evolutionary history where, at 
the outset of the period, the behavior of a protein is optimized for a specific biological function, and where 
that function remains constant for the protein throughout the period being examined, changes in the DNA 
sequence that lead to a change in the sequence of the encoded protein (expressed changes) will diminish 
the survival value of the protein [Benner, S. A., Ellington, A. D. Interpreting the behavior of enzymes. 
Purpose or pedigree? CRC Crit Rev, Biochem. 23, 369-426 (1988)] and therefore will be removed by 
natural selection. During the same period, silent changes will not be removed by natural selection, but will 
accumulate at an approximately clock-like rate, as silent changes are approximately neutral, especially in 
higher organisms. Thus, the ratio of expressed to silent changes will be low during a period of evolution 
of a protein family where the ancestor and its descendants share a common function. 

In contrast, in genes for proteins that are neutrally drifting without functional constraints, the 
expressed/silent ratio will reflect random introduction of point mutations. Given the genetic code and a 
typical distribution of amino acid codons within the gene, a ratio of expressed to silent changes will be 
approximately 2.5 during the period of evolution of a protein family where the ancestor and its 
descendants have no function. 

A third situation concerns a period of evolution where a protein is acquiring a new derived function. 
The amino acid sequence of the protein at the beginning of this episode will be optimized for the ancestral 
function, rather than the derived function. Thus, changes in the gene that are expressed in changes in the 
sequence of the encoded protein that improve the behavior of the protein as is required for the new 
biological function will be selected for. In proteins in such an evolutionary episode seeking new function, 
natural selection seeks expressed changes, and the ratio of expressed to silent substitutions at the DNA 
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level will be high during the period of evolution of a protein family where the function of the ancestor has 
changed with a new function emerging in its descendants. Ratios as high as 4: 1 or more are known. 

In a family of proteins defined by steps (a) through (e) above, individual periods of evolution are 
defined by lines between nodes on an evolutionary tree. In step (c), silent and expressed point mutations 
are assigned to individual periods of evolution. Periods of evolution with high ratios of expressed to silent 
mutations are episodes where physiological function is rapidly changing. Periods of evolution with low 
ratios of expressed to silent mutations are episodes where physiological function is slowly changing. 

Serial No. 08/914,375 showed the application of this approach applied to the leptin family of proteins. 
Leptins are present in mice, where they are believed to modulate feeding behavior. Leptin homologs are 
also present in humans, and the pharmaceutical industry has been excited about exploiting them in the 
treatment of obesity. The conclusion drawn from this hypothesis is that the leptin protein in humans does 
not have the same function as the leptin protein in mice. 

B. Covarion behavior, in which individual residues display different mutability in different 
branches of a tree. 

Functional changes leave signatures in the patterns of sequence evolution in a protein family. 
Covarion behavior was detected in alcohol dehydrogenase [Ben89] and superoxide dismutase [Miy95]. 
As a preliminary study in the past year, elongation factors (EF) serve as an example. These are proteins 
that have diverged far more slowly; indeed, they are archetypal examples of a protein that performs the 
"same" function in all three kingdoms of life. In the example, thirty EF-Tu/EF-la protein sequences were 
aligned over 380 sites using the alignment program DARWIN. Replacement rates per site for bacterial and 
eukaryotic EFs were estimated using a gamma-based, maximum likelihood (ML) model for protein 
sequences (JTT + r) and the phylogeny of Baldauf et al [Bal96] for EF-Tu and EF-loc. An a of 0.78 
was calculated for the entire tree, with a standard deviation (SD) of 0.05 using parametric bootstrapping 
(evolutionary simulations) [Swo96]. Interestingly, the a values for the bacterial and eukaryotic subtrees 
were significantly different from that for the entire tree [0.46 (0.04) and 0.38 (0.04), respectively]. These 
reductions in a for bacteria and eukaryotes alone are expected of a non-stationary covarion process. 

The distribution of rate differences per site between bacterial and eukaryotic EFs is leptokurtotic; i.e., 
over- and under-represented in the mean and tails versus "shoulders," respectively, relative to the 
expectations of a normal distribution. Thirty seven percent of the sites have essentially the same rate in 
the two groups (rate difference of -0), as expected under a stationary gamma process. However, 18 and 
21 sites evolve >2 SD faster in bacteria than eukaryotes, and vice versa, respectively. These 10% of the 
sites are most responsible for the covarion characteristics of EF-Tu and EF-lot. 

Residues displaying abnormal evolutionary behavior were then mapped to a three dimensional model 
of the protein based on a crystal structure of ET-Tu. These were used to generate structural hypotheses 
for the different behavioral differences that were known. For example, bacterial EF-Tu binds GDP -100 
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fold tighter than GTP. Eukaryotic EF-lot, in contrast, binds both with similar affinities. EF-Tu 
regenerates its active form by binding to the single-subunit nucleotide exchange factor EF-Ts. EF-loc 
requires the multi-subunit nucleotide exchange factor EF-1|3y5. EF-loc also interacts with the cytoskeleton 
and may thereby play a role in cellular transformation and apoptosis. EF-Tu can have no such role in 
bacteria. Residues were identified that, at the level of hypothesis, are responsible for each of these 
behavioral differences. 

Covarion behavior indicates changing function. It is therefore expected to correlate positively with 
events with high Ka/K s ratios. Because K a /K s ratios use a silent substitution clock that ticks rapidly, while 
covarion analysis does not, the two are somewhat complementary. 

C. Detecting high absolute rates of amino acid substitution, changes per unit time. 

An alternative way to detect changes in function is to measure the number of amino acids substitutions 
that occur per unit time. This requires that dates be assigned to nodes in an evolutionary tree. This can be 
done by correlation with the paleontological record, as is well known in the art. 

II. Tools that detect conservation of function within a family of proteins. 
A. Compensatory changes 

The conservation of the overall fold after extensive divergences raises the possibility that amino acid 
substitutions at one position in a polypeptide chain might be compensated by substitutions elsewhere in a 
protein. For example, if a Gly at one position inside the folded protein core is replaced by a Tip, it might 
be necessary to substitute a Trp by a Gly at a position distant in the sequence but near in space to 
conserve the overall volume of the core, and therefore the overall folded structure. These assume that if a 
substitution is not compensated, the organism hosting the protein is less fit. 

Individual examples of compensatory changes in proteins have been proposed [Oos86], both by 
analysis of families of natural proteins with known structures 

[Les80][Les82][Cho82][Alt87][Alts88][Bor90] and in proteins into which point mutations have been 
introduced by site-directed mutagenesis [Lim89][Lim92][Bal93]. In these examples, amino acid residues 
distant in the sequence but near in three dimensional space in the folded structure have been observed to 
undergo simultaneous compensatory variation to conserve overall volume, charge, or hydrophobicity. 

Compensatory covariation has been used in the prediction of the tertiary folds. For protein kinase 
[Ben91], for example, an antiparallel beta sheet was predicted for the core of the first domain because of 
two specific compensatory changes identified in consecutive strands in the predicted secondary structural 
model The subsequently determined crystal structure [Kni91] showed not only that antiparallel beta sheet 
existed, but that the side chains of the two residues undergoing compensatory covariation were indeed in 
contact 
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Systematic studies have suggested, however, that the compensatory covariation generates only a small 
signal. The early work by Lesk and Chothia with the globin family found that replacements of 
hydrophobic residues in the core of the protein fold are usually accommodated by small shifts of 
secondary structural elements rather than by size complementary amino acid substitutions 
[Les80][Les82][Cho82]. More recent studies have suggested that a weak compensatory covariation signal 
might exist [Tay94][Shi94][G6b94][Neh94]. Some authors have doubted, however, that the signal is 
adequate to be useful in structure prediction [Tay94]. Others have been more optimistic [Neh94][Shi94]. 
More recently, Chelvanayagam et aL pointed out that the signal might be improved if examples of 
compensatory covariation were sought within explicit evolutionary context [Che97][Che98]. 

In the literature, compensatory changes have been sought by comparing the sequences of two extant 
proteins from contemporary organisms. In principle, any position where an amino acid residue had 
undergone substitution at any point in the time separating the two proteins via the common ancestor might 
be paired with any other position that had also suffered substitution in this time. Such an approach is 
problematic because the evolutionary time separating two contemporary protein sequences can be long; in 
years, it is twice the time since the most recent common ancestor of the two proteins. 

A different way to detect compensatory covariation begins with the recognition that a model for the 
historical past in a protein family can be inferred from a set of homologous protein sequences These 
models have three parts: (a) an evolutionary tree, which shows the genealogical relationships between 
individual proteins in the family, (b) a multiple sequence alignment, which shows the evolutionary 
relationship between individual nucleotides in the genes encoding each family, and (c) reconstructed 
sequences of ancestral proteins that are evolutionary intermediates in the tree. Through the reconstruction 
of ancestral sequences, specific changes in a protein sequence can be assigned to (and isolated to) specific 
branches of the evolutionary tree. Within the context of a reconstructed model for the historical past, 
compensatory covariation should appear as two substitutions occurring on the same branch of the 
evolutionary tree. As these branches can be rather short in length, an analysis based on a reconstructed 
history of a protein family can identify changes that occur nearly simultaneously. These are expected to be 
true indicators of compensation. In principle, a weak compensatory covariation signal observed by the 
comparison of extant sequences should be strengthened by examining individual episodes in divergent 
evolution as reflected by specific branches in the evolutionary tree. 

In preliminary studies, we examined 71 families of proteins from the Master Catalog to learn whether 
reconstructed ancestral sequences will generate a more useful signal for compensatory covariation than 
can be obtained by examining extant sequences. We noticed anecdotally that covariation was more likely 
to occur along branches with low Ka/Ks values. This makes sense, as compensation is necessary only if 
function is conserved. Case studies developed under this project will test this. 

B. Homoplasy 
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One feature commonly observed in the divergent evolution but not modelled well by even advanced 
stochastic models is molecular homoplasy, defined as a character similarity that arose independently in 
different subfamilies of an evolutionary tree [StrOO] 

Molecular homoplasy is best illustrated by an example (Drawing 3). Homoplasy so defined is the 
observed phenomenon; no statement is made as to the mechanism by which homoplasy arises. It may 
reflect selection pressures. The Master Catalog gives us the opportunity to systematically search for 
molecular homoplasy in the database as a whole. 

At one level, homoplasy is simply the statement that selective pressures are forcing the protein to select 
from a subset of the 20 standard amino acids. Thus, it is similar to the bias that is seen in membrane 
proteins, for example (where residues are chosen more frequently from a subset of hydrophobic amino 
acids than in the database as a whole). Homoplasy is more. Not only (in the example) is position 30 
limited to A and P, but the selection pressures have toggled between the two more than once in the 
module's evolutionary history. 

This is, of course, a signature that a functional constraint is conserved in the distant branches of the 
tree protein. For this reason, molecular homoplasy is expected to be a contrarian signature to high K a /K s 
or non-stationary covarion behavior in a protein. We expect it to occur more frequently with proteins that 
are not undergoing functional recruitment. 

Some informative features are already evident from preliminary work. For example, a preliminary 
search of 38 protein families with high resolution crystal structures identified over 2000 examples of 
molecular homoplasy. These were characterized first by the nature of the amino acids identified. A 
number of very obvious patterns emerged. First, the majority of the examples involve the interchange of 
hydrophobic side chains of nearly identical volume. The homoplasy involving I and V was the most 
frequent. It occurred 230 times in the dataset. The IN molecular homoplasy was far more abundant than 
the next most popular hydrophobic/hydrophobic homoplasy, F/Y, which was found 68 times, and the I/L 
hydrophobic/hydrophobic homoplasy, which was found 44 times. As might be expected, the majority of 
these were buried in the three dimensional structure of the protein. 

In the next phase of work we will ask whether these homoplasies are correlated with homoplasies at 
other positions in the same sequence in the same branches of the trees. If the functional constraint at the 
amino acid position are sufficient to permit a protein to confer fitness only if it places one of two residues 
there, then this constraint might be sufficient to cause compensation, also possibly homoplastic, at a 
second position nearby in the folded structure of the protein. Further, it is necessary to characterize the 
branch length (NED or PAM) where the changes occur. 

The most interesting homoplasies are those that involve multiple steps. For example, the Pro/Gly 
homoplasy (at the codon level, CCN to GGN) requires two substitutions. Either of these alone creates a 
change in the encoded amino acid (CGN, Arg, or GCN, Ala). Observing examples of these without 
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observing the intermediates anywhere else in the tree suggests that selection pressure is remarkably strong 
at this position, even though two amino acids appear to be nearly equally suited to perform function. 

Molecular homoplasy indicates a constraint on structure that implies a constant behavior, which in turn 
implies a constant function. If this is true, it should correlate negatively with K a /K s ratios. That is, 
homoplasy should be found less frequently in branches separated by a branch with a high K a /K s ratio 
than in branches not separated by such a branch. Case studies developed under this project will develop 
ways to exploit such a correlation. 

C. Absolute conservation within a defined evolutionary distance 

As disclosed in Serial No. 07/857,224, residues that are conserved over an entire evolutionary tree are 
presumed (at the level of hypothesis) to be important for function, especially if they are chosen from the 
group consisting of Asp, Lys, Arg, Glu, Asn, Cys, His, Gin, Ser, and Thr. As disclosed in that application, 
however, it is important that the overall PAM width of the tree be considered before constructing 
hypotheses about the functional role of conserved residues. 

III. Tools that identify individual residues involved in changes in functionally significant 
behavior. 

In Serial No. 08/914,375, it was disclosed that during episodes of rapid sequence evolution, amino 
acid substitutions will be concentrated in secondary structural elements. These are secondary structural 
elements that are important in the acquisition of new function. These elements might be predicted using 
the method claimed in Serial No. 07/857,224; they might also be known by X-ray crystallography or 
n.rnx., for example. As n Serial No. 08/914,375, a general method for identifying secondary structural 
elements that contribute to the origin of new biological function is comprised of identifying an element in 
the predicted secondary structure model where the corresponding section of the gene has a high ratio of 
expressed to silent changes. 

In this analysis, we must recognize tthat function involves combinations of behaviors of a protein. 
Even when function changes, some features of those behaviors are conserved, and this reflects 
conservation of some features of the sequence as well. In the fumarase/aspartase/adenylosuccinate lyase 
example discussed above, all three proteins have the same overall fold. For this reason, residues critical to 
the folding process (for eample, amino acids whose side chains pack tightly into the folded core) will 
remain conserved even though the overall function of the protein is changing. Relevant to the change in 
function is, of course, a change in a number of behaviors, for example, the ability to bind a particular small 
molecule substrate. Residues involved in substrate binding will therefore be changing rapidly during the 
episode of sequence evolution where function was changing. 
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The notion that some residues are conserved even when function is chaning is matched by the notion 
that some residues will be changing even when function is conserved. The latter are those that can drift 
"neutrally". 

Likewise, "function" remains a concept set within Darwinian evolution. That is, a fumarase from a 
mesophile and a fumarase from a thermophile have analogous function in the sense that they both 
participate (for example) in the citric acid cycle. However, they have different functions, in that one 
contributes to fitness in a thermophile (which requires that it have an associated behavior, thermostability) 
while the other does not. In the epsidoe where the temperature of the environment changes, residues 
involved in conferring th ermal stability will change, while those involved in determining substrate 
specificity will not. 

Tools that assign, even at the level of hypothesis, which residues are involved in which behavior are 
extremely valuable. They can be the targets of protein engineering experiments, for example. In these 
cases, one would like to map residues identified using tools of the instant invention on to a three 
dimensional structure of a representative member of a protein family. 

Already in 1988, the Applicant was using a general form of mapping that showed the utility of this in 
extracting information about the function of a protein, in this case, alcohol dehydrogenase [Ben88 xxx]. 
More recently, Lichtarge et al. introduced an evolutionary trace method that defined functionally 
significant residues as those that are conserved within a family [O. Lichtarge, H. R. Bourne, F. E. Cohen, 
An evolutionary trace analysis defines binding surfaces common to protein families. J. Mol Biol 257, 
342-358 (1996).]. They then used this approach to identify patches on the surface of proteins that 
contribute to functionality. 

As it was published, the evolutionary trace method was related to the method disclosed in Serial No. 
07/857,224, and was applied to conserve amino acid residues. The aproach did not contemplate the 
possibility that function might change within a family of proteins, and the residues important for function 
would change with it. Indeed, to detect such changes would require tools disclosed in this application and 
in Serial No. 08/914,375 to be broadly useful. 

A. Residues changing in episodes with high Ka/Ks values, minus residues changing in 
episodes with low Ka/Ks values 

We have posited that function is changing during an episode with high Ka/K s values. As disclosed in 
Serial No. 08/914,375, individual residues can be identified as changing during that episode, as the basic 
evolutionary model has sequences reconstructed at each individual node. These are, at the level of 
hypothesis, residues that are important to functional change. 

As one of ordinary skill in the art recognizes, the episode also includes a number of substitutions that 
have no relevance to function or the change in function, but rather reflect the background, neutral drift. For 
example, these residues might lie on the surface of the protein, be in contact with bulk solvent, and not 
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have any especially strong functonal constraint that prevents them from diverging. As disclosed in Serial 
No. 07/857,224, surface residues are likely to be neutrally drifmg in many sub-families within an 
evolutionary tree. For this reason, we can identify residues that are changing along branches of an 
evolutionary tree that have low Ka/K s values, and subtract them from residues changing in episodes with 
high K a /K s values. What remains are residues more likely, again at the level of hypothesis, to be involved 
in the change in function. 

Serial No. 07/857,224 disclosed and claimed methods for correlating changes in sequence with 
changes in the behavior of the protein. This in turn provides a method for identifying behavioral changes 
that are relevant to the change in function. 

B. Residues displaying covarion behavior 

Again because the basic evolutionary model includes reconstructed ancestral intermediates, the 
methods of the instant invention identify specific residues that are displaying covarion behavior. These are 
residues that are under analogous functional constraints in different sub-families of the tree. This, in turn, 
implies that these particular residues contribute to a behavior that is conserved for a conserved feature of 
the function in distant branches of the tree. 

C. Mapping these residues on to models for the secondary, tertiary, and quaternary structure of 
proteins. 

Insight into the relationship between function and amino acid sequence can be gained by mapping 
residues identified by K a /K s and covarion analysis onto a three dimensional structure. This identifies, for 
any particular branch, which residues are involved in changing function. This information is useful when 
attempting to identify residues that might be changed in a protein engineering experiment, for example. 

IV. Tools that identify individual residues involved in conserved of functionally significant 
behavior 

The type of analysis used for class HI tools can also be applied to class IV tools. 
A. Residues suffering compensatory changes 

When a pair of residues suffers compensatory changes during a particular episode of protein 
sequence evolution, this implies that some physical property of the protein family must be the same at the 
end of the episode as it was at the beginning. This implies some conserved behavior important across that 
episode. The episode can, of course, be one where function in some sense is changing. Thus, in the 
fumarase/aspartase example outlined above, one might identify residues the suffer compensatory changes 
during episodes where catalytic behavior is changing. These are residues most likely (at the level of 
hypothesis) to be important for folding, which is conserved over this episode. We can therefore use the 
methods of the instant invention to identify individual residues involved in conserved of functionally 
significant behavior 
B* Residues displaying homoplasy 
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Positions that display homoplasy are subject to analogous functional constraints in different branches 
of the tree. Because of the evolutionary reconstructions in the basic evolutionary model, we know which 
positions they are are which amino acids involved. Therefore, we use the methods of the instant invention 
to identify individual residues involved in conserved of functionally significant behavior 
C. Mapping these residues on to models for the secondary, tertiary, and quaternary structure of 

proteins. 

Insight into the relationship between function and amino acid sequence can be gained by mapping 
residues identified by K a /K s and covarion analysis onto a three dimensional structure. This identifies, for 
any particular branch, which residues are involved in changing function. This information is useful when 
attempting to identify residues that might be changed in a protein engineering experiment, for example. 

V. Tools that involve correlation between the evolutionary histories of two families of proteins 

Serial No. 07/857,224 introduced in the first useful form the notion of compensatory changes as a 
way of analyzing divergent evolution in protein sequences. In that application, an example of 
compensatory covariation was identified that indicated the packing of two beta strands in an antiparallel 
fashion. A second use for compensatory changes disclosed was as part of a tool to detect disulfide bonds 
in a protein; cysteines that arise and/or disappear at the same time during the divergent evolution of a 
protein family frequently form a disulfide bond with each other. Serial No. 08/914,375 extended this 
notion, noting that the introduction and loss of leptin and the leptin receptor might occur in parallel. The 
idea behind this analysis is that residues that interact as they contribute to function, subunits that interact 
as they contribute to function, and even proteins that interact as they contribute to function, display 
correlated evolution. 

Since these applications were filed, various other groups have extended this approach. We review 
briefly two of the areas where research is active, and make comments on why additional invention is 
necessary to make these approaches fully useful 

A. Correlating the topology of evolutionary trees in two families of proteins 

Recently, Pellegrini et al. extended this type of analysis to generate "protein phylogenetic profiles" for 
different organisms [Pellegrini, M., Marcotte, E. M., Thompson, M. J., Eisenberg, D., Yeates, T. O. 
Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles PNAS 96, 
4285-4288 1999]. They present a method that assumed that during evolution, proteins that function 
together tend to be either preserved or eliminated in a new species. They described this property of 
correlated evolution by characterizing each protein by its phylogenetic profile, a string that encodes the 
presence or absence of a protein in every known genome. They suggested that proteins having matching 
or similar profiles strongly tend to be functionally linked. This method of phylogenetic profiling allows us 
to predict the function of uncharacterized proteins. 
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More recently, Cohen and his coworkers used phosphoglycerate kinase (PGK), an enzyme that forms 
its active site between its two domains, to develop a standard for measuring the co-evolution of interacting 
proteins. The N-terminal and C-terminal domains of PGK form the active site at their interface and are 
covalently linked. Therefore, they must have co-evolved to preserve enzyme function. By building two 
phylogenetic trees from multiple sequence alignments of each of the two domains of PGK, they calculated 
a correlation coefficient for the two trees that quantifies the co-evolution of the two domains. The 
correlation coefficient for the trees of the two domains of PGK is 0.79, which establishes an upper bound 
for the co-evolution of a protein domain with its binding partner. Their analysis was extended to ligands 
and their receptors, using the chemokines as a model [Goh, C.-S., Bogan, A. A., Joachimiak, M., Walther, 
D., Cohen, F. W. (2000) Co-evolution of Proteins with their Interaction Partners. 7. Mol Biol xxx, 283- 
293. 

We have no quarrel with either of these approaches; indeed, they are in some ways covered by the 
Applicant's earlier disclosures. It should be recognized, however, that these simple approaches that exploit 
evolutionary analysis are easily defeated by the "ortholog paralog problem", especially when it is coupled 
with gene loss. Briefly, paralogs are generated when a gene duplication occurs internally within a genome, 
to create two homologous genes in the same organism. 

B. Correlating the connectivity of proteins in a gene family 

Eisenberg and his coworkers. Enright [AJ] et al., and others have also suggested that proteins that 
interact in a pathway might be connected physically in the genome, either as an operon or, in some cases, 
in a single expressed polypeptide chain. This interesting approach is applicable to only a subset of the 
database, and is distinct from the tools disclosed here. [Marcotte, E.M., M. Pellegrini, H.L. Ng, D.W. 
Rice, T.O. Yeates, and D. Eisenberg. 1999. Detecting protein function and protein-protein interactions 
from genome sequences. Science. 285: 751-753] 

C. Dating events in the molecular history 

A key element to using evolutionary analysis of correlated change in protein families is to establish that 
the changes being interpreted as evidnce that two proteins interact as they function is to show that the 
changes are contemporaneous, that is, they occur near the same time. This requires tools that date, if only 
approximately, events in the molecular evolutionary tree using sequence data. 

Early hope that protein sequences might change in a "clock-like" fashion [Can82], with a small number 
of rate constants describing the rate of change at most positions in most proteins in most organisms, has 
given way to the reality that the evolution of protein sequences is marked by episodes of rapid and slow 
evolution [Mes97]. These correspond to changing and conserved function within the protein family, 
arising in turn from adaptive and purifying natural selection, respectively. This makes methods based on 
protein sequence divergence unreliable for dating the divergence of protein sequences. 

One well known approach to avoid (to a large extent, at least in metzoans) the influence of purifying 
and adaptive selection on the interpretation of molecular history is to examine changes in non-coding 
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regions of DNA [Li97]. These include introns and substitutions, generally at the third position of a codon, 
that do not change the encoded amino acid. These arise because the genetic code is redundant for many 
amino acids. This approach assumes that silent substitutions at the DNA level have little or no impact on 
fitness (are neutral or nearly neutral) at the level of the organism. While this is almost certainly not a good 
approximation in microorganisms, the approximation appears to be serviceable for metazoans 
(multicellular animals) and plants, presumably because macrophysiology is more visible to selective forces 
than genome sequence itself in multicellular organisms. 

Even silent substitutions are problematic as a molecular clock, however. From a chemical perspective, 
interconverting the four standard nucleobases A, G, T, and G involves 12 rate constants that need not be 
identical [Nei86]. Some models distinguish between transitions (purines replaced by purines, or 
pyrimidines replaced by pyrimidines) and transversions (purines replaced by pyrimidines, or pyrimidines 
replaced by purines), but otherwise group the rate processes together. This problem is revisited frequently 
in the literature [Nei86]. The most widely used method was developed by Li [Li85] with modifications by 
Pamilo and Bianchi [Pam93]. This method aggregates four fold redundant and two fold redundant sites, 
analyzes nucleotide substitution at positions where the encoded amino acid has not changed at the same 
time as it analyzes substitution at positions where the encoded amino acid has changed, and adopts a 
classification of different types of substitutions based on physical chemical characteristics of amino acids. 

Disclosed here for the first time, the Applicant has discovered good part of the inconsistency in the 
dating generated by these methods can be eliminated if one focuses on relatively homogeneous chemical 
processes. In particular, transitions accumulate over large periods of (for example) vertebrate history with 
remarkable constancy, with a pseudo first order rate constant of 3.0 x 10 -9 changes/base/year. A tool 
based on this discovery begins by extracting aligned pairs of codons from a pairwise alignment where two 
fold redundant amino acids (CDEFHKNQ Y) are conserved. Substitution at the silent position is then 
modelled using an exponential "approach to equilibrium" rate law, where £2 is the fraction of the codons 
encoding conserved 2FR amino acids that are themselves conserved: f2 = [0.5 # exp(-£r)] + 0.5, where k is 
a single pseudo first order rate constant for transitions, and t is the time. The neutral evolutionary distance 
(NED ) between two genes x and y is defined by NED^-y- kt x ,y = 4n[(f2 X jrh0.5)/0.5]. 

NEDs represent one choice in a trade-off, between the instinct of a statistician (to maximize the number 
of characters being examined, and hence minimize error due to fluctuation) and the instinct of an organic 
chemist (to seek homogeneous rate processes, and hence minimize systematic error due to aggregation of 
different kinds of events). 

The NED is a measure of evolutionary distance, not evolutionary time. If one knows the rate constant, 
and assumes that k is constant over the period of evolutionary history being examined, one can calculate 
the time of divergence. Given the same assumption and the date of evolutionary divergence of two 
sequences, one can calculate k. As distances, NEDs are additive, should obey the triangle inequality, and 
display other features that permit them to be used to build evolutionary trees. 



29 



Steven A. Benner 



The transition-based two fold NED turned out to be remarkably robust measures of evolutionary time. 
When calibrated using datable fossil divergences back to the divergence of fish from land vertebrates, a 
single lineage rate constant of 3 x 10 -9 changes per base per year was obtained in many of the cases we 
examined, applicable (within error) to the divergence of fish from mammals, reptiles and birds from 
mammals, primates from artiodactyls, and artiodactyl genera from other artiodactyl genera. NEDs built 
from four fold redundant systems were far less consistent. 

One of the key issues in the development of evolutionary models is assigning ranges of geological 
dates to nodes in the tree. Early hope that protein sequences might change in a "clock-like" fashion, with a 
small number of rate constants describing the rate of most amino acid substitutions in most proteins in 
most organisms, has given way to the reality that the evolution of protein sequences is marked by episodes 
of rapid and slow evolution. These correspond to changing and conserved function within the protein 
family, arising from adaptive and purifying natural selection respectively. This makes protein sequence 
similarity (for example, point accepted mutations per 100 amino acids, or PAM units) unreliable for dating 
the divergence of protein sequences. 

One well known approach to avoid the influence of purifying and adaptive selection on the 
interpretation of molecular history is to examine changes in non-coding regions of DNA. These include 
introns and substitutions, generally at the third position of a codon, that do not change the encoded amino 
acid. These arise because the genetic code is redundant for many amino acids. Amino acids encoded by 
four synonymous codons (A4S) are valine, alanine, threonine, proline and glycine. Amino acids encoded 
by two synonymous codons (A^s) are cysteine, aspartic acid, glutamic acid, phenylalanine, histidine, 
lysine, asparagine, glutamine, and tyrosine. One amino acid (isoleucine) is encoded by three synonymous 
codons (A3 ! s). These patterns are found in the eukaryotic nuclear code; other codes exist, of course. 

This approach has a chance of working if silent substitutions at the DNA level have little or no impact 
on fitness at the level of the organism. While this is almost certainly not a good approximation in 
microorganisms (at least for some codons in highly expressed genes), the approximation appears to be 
serviceable for metazoans (multicellular animals), presumably because redundant codon exchange does 
not change the structure or the behavior of any functioning protein, and the structure and behavior of 
functioning proteins, together with the consequent macrophysiology, is more visible to selective forces 
than genome sequence itself. The approach is now empirically shown to be reliable within chordates. 

Even silent substitutions are problematic as a molecular clock, however. From a chemical perspective, 
interconversion of the four standard nucleobases A, G, T, and G involves 12 rate constants that need not 
be identical (there is a large literature on this; see for example [Nei86]). Simpler models have distinguish 
between transitions (purines replaced by purines, or pyrimidines replaced by pyrimidines) and 
transversions (purines replaced by pyrimidines, or pyrimidines replaced by purines), but otherwise 
grouped the rate processes together. 
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This problem has been revisited frequently in the literature. The most widely used method (indeed, the 
one implemented in the present version of the Master Catalog when assigning K a /K s values, following 
some adaptations that we made, Schreiber, Benner unpublished) was developed by Li [Li85] with 
modifications by Pamilo and Bianchi [Pam93] following a suggestion by Kimura. 

In the previous funding period, we developed and tested a NEDs as a tool for dating sequence 
divergences Table 1). NEDs turned out to be remarkably robust measures of evolutionary time. When 
calibrated using datable fossil divergences back to the divergence of fish from land vertebrates, a single 
lineage rate constant of 3 x 10 -9 changes per base per year was obtained in many of the cases we 
examined, applicable (within error) to the divergence of fish from mammals, reptiles and birds from 
mammals, primates from artiodactyls, and artiodactyl genera from other artiodactyl genera. Statistical 
analysis suggests that >80% of the variance arises from simple statistical fluctuation. This suggests the 
absence of "hot spots" and other non-stochastic variation at the 2-fold degenerate sites in the genome. 
Again, relatively expensive tools (such as full blown ML tools) gave insignificantly different results than 
relatively cheap tools (such as the Pamilio-Bianchi approach) in a series of test cased that were applied in 
parallel. 

Table. Average NED values for Pairs of Proteins Extracted from Humans, Pigs, Oxen, Rabbit, Rat, and Mouse 



Species 1 


Species 2 


Number 


kt (range) 


Date 


k (calc.) 


k (average) 




of pairs 


(NED) 


(fossil) 


xl09 


xl()9 










MYA 


changes/base/year 


Human 


Pig 


225 


03990 


80 


2.5 




Human 


Ox 


410 


0.3800 


80 


2.4 


2.4 


Pig 


Ox 


140 


0.2755 


60 


2.3 




Rabbit 


Human 


203 


0.4845 


80 


3.0 




Rat 


Human 


584 


0.4893 


80 


3.0 


3.1 


Mouse 


Ox 


147 


0.5130 


80 


3.2 




Mouse 


Human 


918 


0.4988 


80 


3.1 




Mouse 


Rabbit 


87 


0.5083 


60 


4.2 


5.2 


Mouse 


Rat 


926 


0.2470 


20 


6.2 





D. Correlating evolutionary events in two protein families occuring at approximately the same 
time 

Given approximate dates, we can now provide a more useful tool to correlate events occurring in two 
trees. A duplication in family 1 that is occurring near the time as a duplication occurring in family 2 is 
hypothesized to indicate that the two families (and, in particular, the proteins arising from the duplication) 
interact when they function. Conversely, and frequently quite usefully, a duplication in family 1 that did 
not occur near the time as a duplication occurring in family 2 is hypothesized to indicate that the two 
proteins arising from the duplication do not interact when they function. These hypotheses are ueful when 
designing two-hybrid systems, for example, to detect protein-protein contacts. 
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E. Correlating evolutionary events in two protein families that are associated with analogous 
behavior involving expressed/silent ratios 

When there is a duplication, the question arises: Which of the derived genes is performing the derived 
function, and which is performing the ancestral function? According to the method of this invention, the 
derived protein is the one connected to the node where the duplication has occurred via the higher K a /K s 
value. This concept supports a useful tool to correlate events occurring in two trees. A duplication in 
family 1 that is occurring near the time as a duplication occurring in family 2 is hypothesized to indicate 
that the proteins arising from the duplication from the branch having the higher K a /K s value in one tree 
interact when they function with the proteins arising from the duplication from the branch having the 
higher K a /K s value in one tree interact when they function with the. Conversely, and frequently quite 
usefully, when examining two contemporarneous duplication events in two separate families, the proteins 
in family 1 that do not interact with the proteins in family 2 are those that are not joined to their respective 
nodes via branches that display, during contemporaneous periods of evolution, high K a /K s values. 

As one of ordinary skill in the art will appreciate, this approach is quite general, and can be applied 
with covarion behavior, compensatory substitution, homoplasy, and even levels of high sequence 
conservation. 

VI. Tools that involve correlation between the evolutionary history of a family of proteins and 
the evolutionary history of the organism as known from some source other than genomic 
sequence data, including paleontology, geology, ecology, ontogeny, phylogeny, or systematics 
(collectively known as the "non-genomic record". 

The methods of this invention extract information about function and function change by analyzing 
sequence data alone, and then by coupling this analysis with secondary, tertiary, and quaternary structural 
data. Those of ordinary skill in the art know, of course, of other sources of evoluionary information that 
does not come from genomic sequence data or crystal structures. These "non-genomic" data come from 
paleontology, geology, ecology, ontogeny, phylogeny, and systematics (collectively known as the "non- 
genomic record"). 

A. Correlating the topology of an evolutionary trees and the non-genomic record. 

Conversely, and quite usefully, when a node in an evolutionary tree 

Dates can be obtained approximately by protein sequence analysis. In cases where silent substitutions 
have not equilibrated, NED distances or other distances based on the analysis of silent codon substitutions 
can be used. 

As discussed above, detailed analyses of evolutionary histories frequently can provide a solution to the 
most general problem of the conventional evolutionary paradigm, the difficulty in routinely identifying a 
homolog of a target sequence with known function within the database. By analysis of non-Markovian 
evolutionary behavior at the level of the protein, a model of secondary structure can be predicted. This 
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prediction can be used in turn to detect long distance homologs in some cases and exclude the possibility 
of distant homology in others. This increases the likelihood that a homolog will be found with a known 
structure, behavior, or function for a new protein sequence. If one is found, then the logic associated with 
the conventional evolutionary paradigm can be applied to generate a hypothesis concerning the behavior or 
function of the protein. 

The value of this post-genomic tool to assign behavior and structure to a target sequence problem is 
expected to grow over the near term, as the ratio of sequences supported by experimental studies to those 
not supported increases with the conclusion of genome projects, and as more sequences increase the detail 
of the evolutionary histories that can be extracted from the database directly, and therefore the quality of 
the predicted secondary structural model 

At the next level, analysis of non-Markovian behavior at the level of the gene can alert the biological 
chemist that the logic associated with the conventional evolutionary paradigm might not apply in individual 
cases. In particular, if an episode of rapid sequence evolution intervenes in the evolutionary tree between 
the sequence of interest and the sequence with the know behavior and function, the biological chemist is 
alerted to the possibility that the function of the protein might have changed. This alert is useful even with 
close homologs, as illustrated in the example with leptin. 

But what if the evolutionary tree contains no protein with a sequence with assigned function, even one 
with low sequence similarity? Even with more limited evolutionary histories, post-genomic tools that 
analyze non-Markovian evolution at the level of the codon can be useful By identifying the organisms 
that provide the sequences at the "leaves" of the evolutionary tree, it is frequently possible to correlate 
branches in the evolutionary tree with episodes in geological history, as determined from the fossil record. 
Especially in multicellular animals (metazoa), the fossil record can provide approximate dates for the 
emergence of new physiological function. In this case, it is possible to ask whether an episode of rapid 
sequence evolution in a protein family (in particular, an episode with a high expressed/silent ratio) 
occurred at the same time as a new physiological function emerged on earth. If so, a first level of 
hypothesis about physiological function can be proposed, even if no behavior or function of any kind is 
known for any of the modern proteins. 

Perhaps the most transparent analysis of this type concerns proteins that underwent massive radiative 
divergences in metazoa approximately 600 million years ago. This is the time of the Cambrian explosion, 
an episode in terrestrial history that marks the massive radiative divergence of multicellular animals, 
including chordates. Proteins families undergoing rapid evolution at this time (for example, of protein 
tyrosine kinases and src homology 2 domains) are almost certainly involved in the basic processes by 
which multicellular animals develop from a single fertilized egg. 

This type of analysis might be applied in the family of ribonuclease (RNase) A (E.C.2.7.7.16), a well 
known family of digestive proteins found in ruminants. The protein underwent rapid sequence evolution 
approximately 45 million years ago, a time where ruminant digestion emerged in mammals [T. M. 
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JERMANN, J. G. OPITZ, J. STACKHOUSE, J. and S. A. BENNER, Reconstructing the evolutionary 
history of the artiodactyl ribo nuclease superfamily. Nature 374, 57-59 (1995).]. Thus, the rapid 
molecular evolution evident in the reconstructed evolutionary history of this protein suggests that the 
protein is important for ruminant digestive function. 

B. Correlating features of patterns of evolution in specific branches in the evolutionary tree 
with the non-genomic record 

This type of analysis is obviously strengthened if one adds now information concerning Ka/Ks values, 
covarion behavior, homoplasy, and compensatory changes. 

C* Correlating evolutionary events in several protein families occuring at approximately the 
same time with the non-genomic record 

This type of analysis can obviously contribute to the determination of pathways, interactions between 
proteins from different families. These hypotheses are ueful when designing two-hybrid systems, for 
example, to detect protein-protein contacts. 

Use of non-stochastic behavior generally 

One of ordinary skill in the art will recognize from Serial No. 07/857,224 that the methods of the 
instant invention view molecular evolution in a way quite distinct from the way in which standard tools 
analyze protein sequence data. Virtually all tools for comparing the sequences of homologous proteins 
assume a model for divergent evolution that is stochastic in outcome. This model treats a protein sequence 
as a linear string of letters, one letter for each amino acid. According to the model, each letter in the string 
changes (the gene and its corresponding protein mutates) at a rate that is independent of its position. 
According to the stochastic model, future and past mutations are independent. Mutations at one position 
are independent of mutations elsewhere. 

Such a model is at best an approximation for the reality of protein evolution. In reality, proteins are not 
linear strings of letters. Rather, they are organic molecules that fold in three dimensions. In the folded 
form, some positions in a protein sequence are more easily mutatable (without destroying function) than 
others. Amino acids distant in the sequence but close in the fold frequently undergo correlated mutation. 
Future mutations are frequently not independent of past mutations. Thus, real proteins divergently 
evolving under functional constraints behave differently than expected based on the stochastic model 

The difference between the reality of divergent evolution of proteins that fold and expectation based on 
the stochastic model proves to be important, as was disclosed first in Serial No. 07/857,224. By 
comparing the patterns of substitution within a set of folded proteins undergoing divergent evolution with 
expectations for those patterns based on the stochastic model, one can extract information about the fold. 
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This makes the nuclear family more than a database organizational feature. Because the nuclear family 
holds a history of the pattern of divergent evolution under functional constraints in the protein, it holds 
information about the fold of the protein. From the sequences of proteins in the nuclear family alone, one 
can decide which amino acids lie on the surface of the folded structure, which lie inside, and which lie near 
the active site. Elements of secondary structure, the helices, strands, and loops can be identified. A model 
of tertiary structure can be built as well, all from the evolutionary history embodied in the nuclear family. 

EXAMPLES 

Example 1. Functional analysis of aromatase 

Aromatase is a cytochrome P450-dependent enzyme that catalyzes a three step reaction that creates 
an estrogen from an androgen. The physiological consequences of estrogen biosynthesis in human 
biology are well known, even among laymen. Estrogen is also synthesized in primitive chordates such as 
Amphioxus (Callard et al, 1984), but not in other metazoans. Therefore, estrogen appears to have been 
invented as a hormone early in the divergent evolution of chordates, presumably by recruitment of steroids 
involved in developmental biology in more primitive metazoan ancestors. 

Aromatase belongs to the cytochrome P450 superfamily of enzymes, which has some two dozen 
family members (Nebert et al., 1991). Members of the superfamily use a common chemical mechanism 
(Akhtar et al, 1997) to assimilate carbon, detoxify organic substances, and synthesize regulatory 
molecules. In biomedicine, variants of P450 oxidases can determine whether individuals have side effects 
to a therapeutic agent (Gonzalez & Nebert, 1990), and aromatase itself plays a significant role in the 
progression of some cancers. 

Recent research has found remarkable complexity in the molecular biology of the aromatase gene 
family. Two aromatase genes are known in goldfish (Callard and Tchoudakova, 1997). In contrast, only a 
single gene is known in the horse (Boerboom et al., 1997), the rat (Hickey et al., 1990), the mouse 
(Terashima et al., 1991), the human (Harada, 1988), and the rabbit (Delarae et al, 1996). Both a functional 
gene and a pseudogene are found in oxen. The pseudogene is built from homologs of exons 2, 3, 5, 8, and 
9 interspersed with a bovine repeat element (FiirbaB & Vanselow, 1995); it is transcribed but not 
translated. In several mammalian species, a single gene yields multiple forms of the mRNA for aromatase 
in different tissues via alternative splicing mechanisms. This is the case in humans (Simpson et al., 1997) 
and rabbits (Delarue et al. 1998). 

A still different phenomenology is observed in the pig (Sus scrofa). Preliminary studies found three 
distinct mRNA molecules in different tissues with differences in their coding regions (Conley et al. 1996; 
Conley et al. 1997; Choi et al., 1996; Choi et al., 1997a; Choi et al., 1997b). It was suggested that these 
might have arisen from a single gene, possibly via RNA editing or alternative splicing (Conley et al. 
1997). 
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Analogous collections of phenomenology are found throughout contemporary molecular biology 
for many molecular systems. "Why?" questions are often confounded by the complexity of the 
phenomenology. When "just so" stories are proposed, they need not be compelling, especially when they 
are supported by no evidence past the phenomena themselves. 

One approach to obtain additional evidence to address functional questions in systems requires 
placing the molecular biological phenomena within an evolutionary context. To do this for the aromatases 
family, we began with experiments to determine whether the three mRNA isoforms (and the 
corresponding proteins) in pig arose through alternative splicing, via mRNA editing, or from distinct 
genes. PCR primers were designed from sequences located within the previously characterized exon 4 of 
the porcine aromatase type m gene (Choi et aL, 1996, 1997a), a region that the cDNA studies suggested 
might have internal sequence differences (Choi et aL, 1997a; Conley et aL, 1997) and used to amplify pig 
genomic DNA. Initially, eight clones of the PCR products were sequenced. Four of these had the 
sequence corresponding to aromatase isoform I (ovarian type) as identified from cDNA, while four others 
had the sequence corresponding to aromatase isoform in (embryo type) as identified from cDNA. 

With evidence that at least two aromatase genes could be found in pig genomic DNA, a restriction 
enzyme-based assay was designed to search genomic DNA in greater detail. Nsi I digests exon 4 from 
isoform I twice, and isoform III once. Bsm I digests exon 4 from isoform I once, but not exon 4 of 
isoform EL Exon 4 from isoform II (placental type) had no restriction sites for either enzyme. Restriction 
analysis of a total of 23 clones obtained from genomic DNA identified 8, 5, and 10 representatives of 
isoforms I, n, and HI, respectively. No restriction digestion patterns indicative of a novel sequence were 
observed. Representative clones for isoforms I, n, and IE were then sequenced. To further confirm the 
presence of exactly three aromatase isoforms within the porcine genome, primer pairs were designed from 
within the 5* and 3' junctions of exon 7. Sequence analysis of 10 clones derived from the PCR products 
identified six and four clones of isoforms II and HI, respectively 

With compelling evidence that the three variants of mRNA identified in cDNA studies arose from 
three paralogous genes (as opposed to editing or alternative splicing), we sought to place the paralogous 
genes within their historical context. Following standard tools to analyze protein sequences, pairwise 
alignments were constructed for the 136 pairs of proteins. An evolutionary distance (in PAM units) was 
calculated (with a variance) for each pair (Table 1). From this, an evolutionary tree was built for the 
mammalian sequences (Drawing 4), with branch lengths along internal nodes calculated to minimize a 
least squares distance were then constructed within the Darwin programming environment. The tree was 
adjusted to make the human and equine branchings consistent with paleontological records to obtain a 
"best consensus" tree. The sequences of the ancestral genes and proteins at branch points in the tree were 
then reconstructed. From there, mutations (including fractional mutations) at both the DNA level and 
protein level were assigned to individual branches in the tree using the method of Fitch (1971). 
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Based on the tree and the reconstructed evolutionary intermediates, K a /K s values were assigned to 
individual branches using the method of Li et al. (1985). These reflect the normalized ratio of 
substitutions at the level of the gene that change the encoded polypeptide sequence (non-synonymous 
substitutions) to substitutions at the level of the gene that do not change the encoded polypeptide sequence 
(synonymous substitutions). Lower K a /K s values generally reflect conservative episodes of evolution 
where function remains constant, while higher values frequently characterize episodes of evolution where 
function is changing (Trabesinger-Ruef et al., 1996; Messier & Stewart, 1997). 

The average branch in the aromatase evolutionary tree has a value of K a /K s of 0.348. Inspection of 
the tree shows that the highest K a /K s values anywhere in the mammalian aromatase family (0.85 and 0.66) 
are found within the divergent evolution of the pig aromatases. These suggest that adaptive changes 
occurred during the triplication of the aromatase gene in pigs. Adaptive changes are well known to 
confuse simple models of molecular history built from standard sequence alignment and tree construction 
tools. Adaptive substitutions do not conform to stochastic rules modelling divergent evolution (Benner et 
al, 1997), do not accumulate in a clock-like fashion, and may arise through convergent and parallel 
evolution (Stewart et al., 1987). 

Therefore, the evolutionary history of the aromatase family was re-analyzed using pairwise Neutral 
Evolutionary Distances (NEDs) (Liberies et al, 1999), obtained for the 136 pairs of aligned aromatase 
genes (Table 2). To estimate NEDs between the aromatase gene pairs, the number (n) of "2-fold 
redundant amino acids" (Cys, Asp, Glu, Phe, His, Lys, Asn, Gin, and Tyr) that are conserved in the aligned 
pairs was determined. The number of those amino acids that are encoded by the same codon (c) was then 
determined, and the fraction (£2 = clri) of the codons that are the same is then tabulated (Table 2). 

A variety of empirical studies show that the fixation of silent substitutions in conserved 2-fold 
redundant codon systems follows rate law that is a simple exponential "approach to equilibrium" f2 = 
[0.5»exp(-fo)] + 0.5, where A: is a single pseudo first order rate constant for transitions, and t is the time 
(Jukes & Cantor, 1969). The NED distance is defined by NED XJ = kt x , y = ln[(f2 XJ +0.5)/0.5]. 

The NED is a measure of evolutionary distance, not of evolutionary time. As distances, NEDs are 
additive, should obey the triangle inequality, and display other features that permit them to be used to build 
evolutionary trees, provided that k is constant over the period of evolutionary history being examined. A 
variety of empirical studies shows this to be approximately the case for many protein families. The 
approximation appears to be quite good for aromatase as well. Thus, if a fixed single lineage first order 
rate constant of 3 x 10~ 9 changes per base per year is assumed, the NED values indicate that fish and land 
vertebrates diverged 340 million years ago (mya), birds and mammals diverged 250 mya, primates and 
ungulates diverged 73 mya, horse and artiodactyls diverged 71 mya, and pigs and ruminants diverged 62 
mya. Each of these dates is close to the date suggested by the paleontological record (Carroll, 1988). 

The NED-based dating was used to assess two alternative models to explain the triplication of 
aromatase gene family in pigs. The first, advanced by Callard and Tchoudakova (1997), holds that the 
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physiological specialization of aromatases through the formation of paralogs occurred early in vertebrate 
divergence, perhaps 400 mya, before fish and mammals diverged. If this were the case, then a functional 
explanation for the aromatase genes must be sought in fundamental features of vertebrate developmental 
biology, those that emerged early in vertebrate evolution. Conversely, the triplication of aromatase may 
occur in response to the domestication of pigs. In this case, a functional explanation for the aromatase 
genes would be found in the selective pressures applied by breeding programs. 

The NEDs separating the three pig isoforms range from 0. 154 (corresponding to a distance of 5 1 
million years between the proteins) to 0.199 (corresponding to a distance of 66 million years). 
Recognizing that the total distances between two proteins are twice the distance along a single lineage 
from the point of divergence to the modern protein (half of the distance occurrs along one lineage after 
divergence, and half of the distance occurs along the other lineage), the NEDs suggest that the first 
duplication led to the three porcine aromatase genes occurred ca. 33 mya, and the second occurred ca. 25 
mya. An evolutionary tree constructed from these NEDs is consistent with these conclusions, showing 
that the porcine aromatases branched after the lineage leading to pig diverged from the lineage leading to 
ox (Drawing 5). This tree shows a different branching order for the three porcine paralogs than the tree 
based on amino acid sequences, something not uncommon in the presence of substantial adaptive 
evolution. Nevertheless, the data are consistent with an evolutionary model that holds that the ancestor of 
pig and oxen (approximated in the fossil record most closely by the now extinct Diacodexis which lived 
perhaps 55 mya) contained a single aromatase gene, and that the paralogous genes in pig arose ca. 25 
million years later. Thus, the paralogs in pig can be explained neither in terms of the fundamentals of 
vertebrate development, nor as a consequence of swine domestication. 

Error in these dates can arise from two sources, standard error (which arises from fluctuation) and 
systematic error (which arises from the fact that the evolutionary model does not represent actual 
evolution). The first can be calculated by standard statistical approaches using standard statistical 
assumptions. The second cannot be calculated, as too little is known about possible systematic errors in 
the evolutionary model. The £2 distances are each based on ca. 120 two-fold redundant codon systems, 
and variances for the NEDs are given in Table 2. Inspection of the tree in Drawing 5 gives an indication of 
the actual error, as the NED between any ancestral sequences and all modern sequences derived from it 
should be the same. The calculated distance from the divergence of the three porcine enzymes to the type 
II enzyme is 3 1 million years, to isoform I is 32 million years, and to isoform m is 30 million years. 
Thus, the average reported (3 1 mya) could be as low as 30 and as high as 32 mya. All of these dates are in 
the Oligocene, after the first episode of cooling. The divergence of isoform I and IE ranges from 24-26 
mya. These apparent errors are less than the errors associated with the dating (from the fossil record) used 
to set the molecular clock. 

Instead, an understanding of why pigs have three genes for aromatase must lie in the environment of 
(and events that occurred during) a time on Earth 25-33 mya. For this we turn to the paleontological, 
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paleogeographical, and paleoclimatological records of that period, which is near the boundary between the 
Oligocene (38-25 mya) and the Miocene (25-5 mya), two epochs in the Cenozoic "Age of Mammals" 
(Prothero, 1994). This period is an unusual one in the history of the Earth. When characterized globally, 
the Earth during the Eocene (54 - 38 mya) was warm and tropical, evidently free of ice over the entire 
planet. By the end of the Eocene, however, the Earth had begun to suffer a dramatic cooling that was to 
lower the mean annual temperature by as much as 15 °C (Wolfe, 1978). Areas of the planet became 
covered with ice. And the impact of the cooling on the biosphere was dramatic. For example, perhaps 80% 
of the North American faunal genera became extinct (Prothero pp 113-1 14; Stucky, 1990). By the end of 
the Oligocene and into the Miocene 25 mya, however, the global cooling abated, the climate turned 
warmer, and the biosphere became more tropical. 

Did this climate change occur in the environment where the ancestors of modern pigs were living 
just before the Oligocene-Miocene boundary? At this time, the North American and Eurasian fauna were 
geographically isolated. Modern peccaries {Tayassuidae), not pigs, emerged in the New World from 
ancestral suids that immigrated from Asia. North America cannot be the site for the triplication of the 
aromatase genes in pig, therefore, and its climate 25-33 mya is irrelevant to an explanation for the 
triplication of the aromatase genes in pigs. 

Instead, modern pigs most likely emerged in Europe near the end of the Oligocene (Cooke & 
Wilkinson, 1978, but see also Pilgrim, 1941) from more primitive entelodonts such as Archaeotherium. 
During the Oligocene, the Dichobunids (the most probable ancestral stock) were most abundant in 
Europe. Likewise, the first true pig, Propalaeochoerus, from the late Oligocene, was common only in 
Europe (Cooke and Wilkinson, 1978; Carroll, 1988). This makes the paleoenvironment of Europe near 
the Oligocene-Miocene boundary relevant to the functional implications of the aromatase gene triplication 
in pigs. 

Various paleobiological evidence suggests that the climate in Europe also deteriorated in the 
Oligocene and warmed in the Miocene. A study of amphibian distribution in the Oligocene of Europe, for 
example, is consistent with a significant drop of mean annual temperatures in the European Oligocene. In 
the Miocene, amphibians populations rebounded, corresponding to an improvement in the climate (Rocek, 
1996). Likewise, analysis of the deer population suggested a subtropical climate returning to Europe in the 
early Miocene (Anzanza, 1993). The Iberian peninsula in the early Miocene had an intertropical to 
subtropical climate (Murelaga et al, 1999). Crocodiles also returned to Europe at the Oligocene-Miocene 
boundary (Antunes & Cahuzac, 1999). The presence of arboreal primates in the European Miocene also 
suggests a forested environment (Qi & Beard, 1998). Each of these facts (and many others) suggests that 
the second duplication of the aromatase gene in pigs occurred at the same time as the return of subtropical 
and warm temperate forests and woodlands to Europe, the type of environment for which suids are best 
adapted (Fortelius et al., 1996). 
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Immediately thereafter, the suids underwent a significant radiative divergence, and came to occupy all 
of the Old World. By the early Miocene, the two basal members that were to lead to all modern pigs, 
Hyotherium and Xenochoerus, were widespread in Europe, Asia, and Africa. The amelioration of the 
climate evidently assisted in this spread. For example, the pigs now in Africa apparently came from 
southwest Asia in the Early Miocene. A fossil of this date of a tetraconodontine pig has been reported 
from the Levant (van der Made & Tuna, 1999), through which the pigs would have migrated to get from 
Eurasia to Africa, and which was a tropical environment at the beginning of the Miocene (Tchernov, 
1992). In the middle and late Miocene, modern suids had diversified in Europe in further response to the 
change in the paleoclimate (Fortelius et al., 1996). 

Why might a change in climate with a return of forested (and perhaps tropical) ecosystems have led 
to a selection of pigs that had three different aromatase genes? We turned to porcine reproductive 
physiology for insight. We recently found that the type III aromatase was expressed by the embryo 
between day 1 1 and day 13 following fertilization, during the late pre-implantation period (Choi et al., 
1997a,b). The estrogen generated by the type HI isoform causes uterine undulation. This undulation, in 
turn, is expected to cause the spacing of the ca. 30 eggs that are fertilized in a typical conception, which 
eventually yield the 8-12 piglets that are normally birthed. In pigs, if the litter does not contain at least 5 
individuals, the entire conception is aborted. Thus, the embryonic form of aromatase may have a role in 
spacing the embryos uniformly around the uterus, and preventing abortion. These are useful adaptations if 
one wants to have an increased litter size. 

Evidence in the paleontological record suggests that the size of the litter in pigs increased 
dramatically 25-30 mya, at the same time as isoform HI of aromatase was generated by triplication, the 
local paleoclimate wanned, and the pigs began a major radiative divergence. The ancestral suid 
Archaeotheriutn, disappearing from the fossil record at the end of the Oligocene, may have given birth to a 
single pup. All of the contemporary forms of pigs arising from the divergence of Hyotherium and 
Xenochoerus, known from the Early Miocene, have large litter sizes. Further, Archaeomeryx, the early 
Eocene artiodactyl that is presumed to be the ancestral ruminant, resembles the contemporary chevrotain, 
which also births a single pup. 

The biogeography of the suids was again consulted to test the hypothesis that litter size increased in 
the suids near the time that the climate changed and the aromatase gene triplicated. As noted above, 
peccaries were isolated in the New World in the Early Oligocene, before the NED-derived date for the 
triplication of the aromatase gene in the Old World pigs. Consistent with the model, the peccary has only 
one offspring. The model predicts as well that the peccary should have only a single aromatase gene. 

Pig 

Type I C AAT CAT TAC ACG TGC CGA TTT GGC AGC AAA CTT GGG TTG GAA 
NHYTCRFGSKLGLE 
III T AGT CAC TAC ACA TCC CGA TTT GGC AGC AAA CCT GGG TTG CAG 
SHYTSRFGSKPGLQ 
II C AGT CAC TAC ACA TCC CGA TTC GGC AGC AAA CCT GGG TTG GAG 
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SHYTSRFGSKPGLE 
Peccary C AGT CAC TAC ACA TCC CGA TTC GGC AGC AAA CCT GGG TTG CAG 
SHYTSRFGSKPGLQ 



Pig 

Type I TGC ATT GGC ATG CAT GAA 
C I G M H E 
III TTC ATT GGC ATG CAT GAG 
F I G M H E 
II TGC ATC GGC ATG TAT GAG 
C I G M Y E 
Peccary TTC ATT GGA ATG CAT GAG 
F I G M H E 



AAA GGC ATC ATG TTT AAC AAT AA 

KG IMFNNN 
AAA GGC ATT ATA TTC AAC AAT AA 

KGIIFNNN 
AAG GGC ATC ATA TTT AAT AAT GA 

KGIIFNND 
AAA GGC ATC ATA TTT AAC AAC AA 

KGIIFNNN 



To test this prediction, peccary seminal plasma (from the Center for Reproduction of Endangered 
Species, Zoological Society of San Diego) was subjected to PCR amplification using exon 4-specific 
primers as described above. Bands having the expected sizes were observed by agarose gel 
electrophoresis. Five clones derived from the PCR products were found to have identical sequences, all 
different from the sequences of the pig aromatase. The NED comparison (using a rate constant of 3 x lO 
9 changes per base per year) suggested that the peccary diverged 40 mya from the pig, corresponding to 
the fossil record and the known isolation of the New and Old World paleoecosystems. 

The molecular biological, fossil, paleoecological, and physiological evidence are all consistent with a 
model that proposes that climate changes in Europe at the end of the Oligocene selected for pigs that had 
larger litter sizes. The successful lineage generated a new embryo aromatase by gene duplication, and 
expressed it at the time of implantation, forming the molecular basis of the physiology that enabled large 
litter sizes. It is possible to speculate on why a conversion from an open, savannah like environment to a 
forested environment might enable larger litter sizes. Contemporary savannah babies are large and born 
with the ability to run, presumably because hiding is no alternative. In contrast, in a forested environment, 
pups are easier to hide, permitting them to be smaller and less precocious at birth, permitting in turn a 
larger number of pups for the same total birth weight. Indeed, the contemporary Sus scrofa sow hides her 
piglets in earthen hollows covered with leaves (Eisenberg, 1981). 

Implantation is one of the least well understood steps in mammalian reproductive biology, including 
human reproductive biology. Implantation is, of course, found only in mammal reproductive physiology, 
and is itself therefore a relatively recent innovation in physiology, emerging perhaps 200 million years 
ago. This analysis emphasizes the degree of innovation and experimentation that is continuing in 
mammalian reproductive physiology. Further, the analysis is a combination of computational informatics, 
geology, paleontology, physiology, molecular biology and chemistry. Analogous analyses should be 
applicable in functional genomics throughout the biological, biomedical and biochemical sciences, 
especially as genome projects are completed and as new tools become available to analyze genomic 
databases. 
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Example 2. Covarion behavior 

Functional changes leave signatures in the patterns of sequence evolution in a protein family. 
Covarion behavior was detected in alcohol dehydrogenase [Ben89] and superoxide dismutase 
[Miy95], As a preliminary study in the past year, we examined elongation factors (EF). These are 
proteins that have diverged far more slowly; indeed, they are archetypal examples of a protein that 
performs the "same" function in all three kingdoms of life. 

In the study, thirty EF-Tu/EF-lct protein sequences were aligned over 380 sites using the 
alignment program DARWIN. Replacement rates per site for bacterial and eukaryotic EFs were 
estimated using a gamma-based, maximum likelihood (ML) model for protein sequences (JTT + r) 
and the phylogeny of Baldauf et al [Bal961 for EF-Tu and EF-la. An a of 0.78 was calculated for 
the entire tree, with a standard deviation (SD) of 0.05 using parametric bootstrapping (evolutionary 
simulations) [Swo96]. Interestingly, the a values for the bacterial and eukaryotic subtrees were 
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significantly different from that for the entire tree [0.46 (0.04) and 0.38 (0.04), respectively]. These 
reductions in a for bacteria and eukaryotes alone are expected of a non-stationary covarion process. 

The distribution of rate differences per site between bacterial and eukaryotic EFs is leptokurtotic; 
i.e., over- and under-represented in the mean and tails versus "shoulders " respectively, relative to 
the expectations of a normal distribution. Thirty seven percent of the sites have essentially the same 
rate in the two groups (rate difference of ~0), as expected under a stationary gamma process. 
However, 18 and 21 sites evolve >2 SD faster in bacteria than eukaryotes, and vice versa, respectively. 
These 10% of the sites are most responsible for the covarion characteristics of EF-Tu and EF-la. 

Residues displaying abnormal evolutionary behavior were then mapped to a three dimensional 
model of the protein based on a crystal structure of ET-Tu. These were used to generate structural 
hypotheses for the different behavioral differences that were known. For example, bacterial EF-Tu 
binds GDP -100 fold tighter than GTP. Eukaryotic EF-la, in contrast, binds both with similar 
affinities. EF-Tu regenerates its active form by binding to the single-subunit nucleotide exchange 
factor EF-Ts. EF-la requires the multi-subunit nucleotide exchange factor EF-lpy5. EF-la also 
interacts with the cytoskeleton and may thereby play a role in cellular transformation and apoptosis. 
EF-Tu can have no such role in bacteria. Residues were identified that, at the level of hypothesis, are 
responsible for each of these behavioral differences. 

Covarion behavior indicates changing function. It is therefore expected to correlate positively with 
events with high K a /K s ratios. We will see if that is correct. Because K a /K s ratios use a silent 
substitution clock that ticks rapidly, while covarion analysis does not, the two are somewhat 
complementary. This addresses another of the concerns of the referees, who objected that K a /K s 
ratios were not applicable far enough back in time for their tastes (only ca. 500 my). 

Example 3. Identifying mutations and in vitro properties of seminal ribonuclease that contribute to 
selected function. 

Bovine seminal ribonuclease (RNase) diverged from bovine pancreatic RNase approximately 35 
million years ago. Seminal RNase represents approximately 2% of the total protein in bovine seminal 
plasma. It displays antispermatogenic activity [Dostal, J., Matousek, J. (1973) Isolation and some 
chemical properties of aspermatogenic substance from bull seminal vesicle fluid. /. Reprod. Fertil 33, 
263-274], immunosuppressive activity [Soucek, J., Matousek, J. (1981) Inhibitory effect of bovine seminal 
ribonuclease on activated lymphocytes and lymphoblastoid cell lines in vitro. Folia Biol Praha 27, 334- 
345. Soucek, J., Hruba, A., Paluska, E., Chudomel, V., Dostal, J., Matousek, J. (1983) Immunosuppressive 
effects of bovine seminal fluid fractions with ribonuclease activity. Folia biologica (Praha) 29, 250-261. 
Soucek, J., Chudomel, V., Potmesilova, L, Novak, J. T. (1986) Effect of ribonucleases on cell, mediated 
lympholysis reaction and on GM, CFC colonies in bone marrow culture. Nat Immun. Cell Growth 
Regul 5, 250-258], and cytostatic activity against many transformed cell lines [Matousek, J. (1973) The 
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effect of bovine seminal ribonuclease on cells of Crocker tumor in mice. Experientia 29, 858. Vescia, S., 
Tramontano, D., Augusti-Tocco, G., D'Alessio, G. (1980) In vitro studies on selective inhibition of tumor 
cell growth by seminal ribonuclease. Cancer Res. 40, 3740 ] Each of these biological activities is 
essentially absent from pancreatic RNase. Further, seminal RNase binds to anionic glycolipids, binds and 
melts duplex DNA, hydrolyzes duplex RNA, has a dimeric quaternary structure, and binds to 
spermatozoa. 

Each of these behaviors is measured in vitro and is well known in the art. In the absence of the method 
of the instant invention, the behaviors are difficult to interpret. Some, any, or all of the behaviors might 
serve an adaptive role. It is possible that none of these behaviors serve adaptive roles. Indeed, it is 
conceivable that the protein has no adaptive role at all. This makes it difficult to make even the simplest 
research decisions, as the only in vitro properties of a protein that are interesting to study are those that 
have a physiological function. 

To resolve these issues, genes for seminal and pancreatic RNases were obtained from a variety of 
organisms closely related to Bos taurus, using cloning procedures well known in the art. These were then 
sequenced, and a maximum parsimony tree was constructed using MacClade. From this tree were 
calculated the sequences of RNases that were intermediates in the evolution of the seminal RNase, using 
the maximum parsimony method well known in the art. 

Next, the ratio of expressed to silent substitutions was calculated along each branch of the 
evolutionary tree. A very high ratio of expressed to silent substitutions was observed in the evolutionary 
period following the divergence of kudu [Trabesinger-Riif, N., Jermann, T. M., Zankel, T. R., Durrant, B., 
Frank, G., Benner. S. A. Pseudogenes in ribonuclease evolution. A source of new biomacromolecular 
function? FEBS Lett 382, 319-322 (1996)] from the lineage leading to ox, until the divergence of water 
buffalo and ox. This is indicative of an episode of adaptive evolution, where the protein acquires a new 
physiological function. Further work indicated that the seminal RNase gene was not expressed in the 
period of evolution since the divergence of the seminal RNase family and the divergence of kudu. 

Last, protein engineering methods were used to prepare the seminal RNase that was at the beginning 
of the episode of rapid sequence evolution. It properties were then examined experimentally. It was 
discovered that the ability of the protein to bind to anionic glycolipids was roughly the same before and 
after this episode of rapid evolution. So too was its sensitivity to inhibition by placental RNase inhibitor. 
Thus, both of these properties are not likely to be under selective pressure. 

In contrast, the immunosuppressivity of the ancestral RNase (IC50 ca. 8 micrograms/mL) was greater 
than that of pancreatic RNase (IC50 ca. 100 micrograms/mL). But following the period of rapid sequence 
evolution characteristic of a protein evolving to serve a new physiological function, the 
immunosuppressivity became still greater (IC50 ca. 2 micrograms/mL). Thus, one concludes that 
immunosuppressivity as measured in vitro is a selected trait of the protein, or is closely structurally 
coupled to a trait that is selected. 
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Likewise, the ability of the seminal RNase protein to bind and melt duplex DNA, and to hydrolyze 
duplex RNA, also underwent rapid increase between the time of divergence of kudu from modern ox. 
Thus, it too is either a selected trait of the protein, or is closely structurally coupled to a trait that is 
selected. 

In vitro experiments in biological chemistry extract data on proteins and nucleic acids (for example) 
that are removed from their native environment, often in pure or purified states. While isolation and 
purification of molecules and molecular aggregates from biological systems is an essential part of 
contemporary biological research, the fact that the data are obtained in a non-native environment raises 
questions concerning their physiological relevance. Properties of biological systems determined in vitro 
need not correspond to those in vivo, and properties determined in vitro need have no biological relevance 
in vivo. 

To date, there has been no simple way to say whether or not biological behaviors are important 
physiologically to a host organism. Even in those cases where a relatively strong case can be made for 
physiological relevance (for example, for enzymes that catalyze steps in primary metabolism), it has 
proven to be difficult to decide whether individual properties of that enzymes (kcat, K m , kinetic order, 
stereospecificity, etc.) have physiological relevance. Especially difficult, however, is to ascertain which 
behaviors measures in vitro play roles in "higher" function in metazoa, including digestion, development, 
regulation, reproduction, and complex behavior. 

Analysis of non-Markovian behavior, as described above, permits the biological chemist to identify 
episodes in the history of a protein family where new function is emerging. This suggests a general 
method to determine whether a behavior measured in vitro is important to the evolution of new 
physiological function. We may take the following steps: 

(a) Prepare in the laboratory proteins that have the reconstructed sequences corresponding to the 
ancestral proteins before, during, and after the evolution of new biological function (34), as revealed by an 
episode of high expressed to silent ratio of substitution in a protein. This high ratio compels the 
conclusion that the protein itself serves a physiological role, one that is changing during the period of 
rapid non-Markovian sequence evolution. 

(b) Measure in the laboratory the behavior in question in ancestral proteins before, during, and after the 
evolution of new biological function, as revealed by an episode of high expressed to silent ratio of 
substitution. Those behaviors that increase during this episode are deduced to be important for 
physiological function. Those that do not are not. 

An example of this method was applied to the bovine seminal ribonuclease (RNase) family. Bovine 
seminal RNase diverged from bovine pancreatic RNase approximately 35 million years ago. Seminal 
RNase represents approximately 2% of the total protein in bovine seminal plasma. It displays 
antispermatogenic activity [J. Dostal and J. Matousek, Isolation and some chemical properties of 
aspermatogenic substance from bull seminal vesicle fluid. /. Reprod. Fertil 33, 263-274 (1973),], 
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immunosuppressive activity [J. Soucek, Matousek, I, Inhibitory effect of bovine seminal ribonuclease on 
activated lymphocytes and lymphoblastoid cell lines in vitro. Folia Biol Praha 27, 334-345 (1981)., J. 
Soucek, A. Hruba, E. Paluska, V. Chudomel, J. DostaL and J. Matousek, Immunosuppressive effects of 
bovine seminal fluid fractions with ribonuclease activity. Folia biologica (Praha) 29, 250-261 (1983)., J. 
Soucek, V. Chudomel, L Potmesilova, and J. T. Novak, Effect of ribonucleases on cell, mediated 
lympholysis reaction and on GM, CFC colonies in bone marrow culture. Nat. Immun. Cell Growth 
Regal 5, 250-258 (1986)], and cytostatic activity against many transformed cell lines [J. Matousek, The 
effect of bovine seminal ribonuclease on cells of Crocker tumor in mice. Experientia 29, 858-859 (1973), 
S. Vescia, D. Tramontano, G. Augusti-Tocco and G. D'alessio, In vitro studies on selective inhibition of 
tumor cell growth by seminal ribonuclease. Cancer Res. 40, 3740-3744 (1980)] Each of these biological 
activities is essentially absent from pancreatic RNase. Further, seminal RNase binds to anionic 
glycolipids, binds and melts duplex DNA, hydrolyzes duplex RNA, has a dimeric quaternary structure, 
and binds to spermatozoa. 

Each of these behaviors is measured in vitro, as is the case for a wide range of biological 
phenomenology recorded in the literature. The behaviors are difficult to interpret. Some, any, or all of the 
behaviors might serve an adaptive role. It is possible that none of these behaviors serve adaptive roles. 
Indeed, it is conceivable that the protein has no adaptive role at all. This makes it difficult to make even the 
simplest research decisions, as the only in vitro properties of a protein that are interesting to study are 
those that have a physiological function. 

To resolve these issues using the post-genomic method outlined above, genes for seminal and 
pancreatic RNases were obtained from a variety of organisms closely related to Bos taurus, using cloning 
procedures well known in the art. These were then sequenced, and a maximum parsimony tree was 
constructed using MacClade. From this tree were calculated the sequences of RNases that were 
intermediates in the evolution of the seminal RNase, using the maximum parsimony method and checked 
using maximum likelihood tools implemented in Darwin (23). 

Next, the ratio of expressed to silent substitutions was calculated along each branch of the evolutionary 
tree. A very high ratio of expressed to silent substitutions was observed in the evolutionary period 
following the divergence of cape buffalo [N. Trabesinger-RuF, T. M. Jermann, T. R. Zankel, B. Durrant, 
G. Frank and S. A. Benner, Pseudogenes in ribonuclease evolution. A source of new biomacromolecular 
function? FEBS Lett 382, 319-322 (1996).] from the lineage leading to ox, until the divergence of water 
buffalo and ox. This is indicative of an episode of adaptive evolution, where the protein acquires a new 
physiological function. Further work indicated that the seminal RNase gene was not expressed in the 
period of evolution since the divergence of the seminal RNase family and the divergence of cape buffalo. 

Last, protein engineering methods were used to prepare the seminal RNase that existed at the 
beginning of the episode of rapid sequence evolution. Its properties were then examined experimentally. It 
was discovered that the ability of the protein to bind to anionic glycolipids was roughly the same before 
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and after this episode of rapid evolution. So too was its sensitivity to inhibition by placental RNase 
inhibitor. Thus, both of these properties are not likely to be under selective pressure. 

In contrast, the immunosuppressivity of the ancestral RNase (IC50 ca. 8 micrograms/mL) was greater 
than that of pancreatic RNase (IC50 ca. 100 micrograms/mL) (J. Sleasman, M. Rojas, personal 
communication). But following the period of rapid sequence evolution characteristic of a protein evolving 
to serve a new physiological function, the immunosuppressivity became still greater (IC50 ca. 2 
micrograms/mL). Thus, one concludes that immunosuppressivity as measured in vitro is a selected trait of 
the protein, or is closely structurally coupled to a trait that is selected. 

Likewise, the ability of the seminal RNase protein to bind and melt duplex DN A, and to hydrolyze 
duplex RN A, also underwent rapid increases between the time of divergence of cape buffalo from modern 
ox. Thus, it too is either a selected trait of the protein, or is closely structurally coupled to a trait that is 
selected. In contrast, dimeric structure did not emerge during this period. Dimeric structure, therefore, is 
presumably not as important to the new selected function of the protein, although it may be a trait that was 
initially useful in the selection of the system for further optimization during the period of rapid evolution. 

Example 4. Assignment of episodes of adaptive evolution in the protein leptin, and placing these in 
predicted secondary structural elements 

From the GenBank database, DNA and protein sequences were retrieved for the genes encoding 
leptins and the corresponding proteins, also known as the obesity gene product. A multiple alignment for 
the protein sequences was constructed for the DNA sequences and the protein sequences. These were 
converted to a file suitable for MacClade to use. For both the DNA and protein sequences, a tree using 
MacClade was built based on the known relationship between the organisms from which these sequences 
were derived; this proved to be the most parsimonious tree as well MacClade was also used to built a tree 
for the protein sequences based on the known relationship between organisms; this proved not to be the 
most parsimonious tree (by 1 change). The DNA tree was taken to be definitive because of its consistency 
with the biological (cladistic) data showing that the primates form a clade. 

A secondary structure prediction was made for the protein family using the tools disclosed in Serial 
No. 07/857,224. The evolutionary divergence of the sequences available for the leptin family is small; only 
21 PAM units (point accepted mutations per 100 amino acids), predictions were biased to favor surface 
assignments [Benner, S. A., Badcoe, L, Cohen, M. A., Gerloff, D. L. Bona fide prediction of aspects of 
protein conformation. Assigning interior and surface residues from patterns of variation and conservation 
in homologous protein sequences. 1 Mol Biol 235, 926-958 (1994)]. Thus, positions holding conserved 
KREND were assigned as surface residues, conserved H and Q were assigned to the surface as well, 
while positions holding conserved CST were assigned as uncertain, suface and interior assignments are 
summarized in Table 3. 
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A secondary structure was then predicted for the leptins using the methods disclosed in Serial No. 
07/857,224. The multiple alignment is shown in Table 3. Five separate secondary structural elements were 
identified results are summarized in Table 3. A disulfide bond is presumed to connect positions 96 and 
146. These secondary structural elements can be accommodated by only a small number of overall folds. 
Interestingly, the pattern of secondary structure in this prediction is consistent with an overall fold that 
resembles that seen in cytokines such as colony stimulating factor [Hill, C, P., Osslund, T. D., Eisenberg, 
D. (1993) Proc. Nat Acad. Sci. 90, 5176-5181] and human growth hormone [de Vos, A. M., Ultsch, M. 
& Kossiakoff, A. A. (1992). Science 255, 306-312]. 

To decide whether evolutionary function may have changed under selective pressure during the 
divergent evolution of the protein family, a multiple alignment of the protein sequences and a multiple 
alignment for the corresponding DNA sequences were constructed. A MacClade-generated maximum 
parsimony tree was printed for each position in the protein sequence where there was a change, and for 
each position in the DNA sequence where there was a change. Each mutation on each tree was examined 
by hand, and silent and expressed mutations occurred were assigned to individual branches on the 
evolutionary tree. For each branch of the tree, the sum of the number of silent and expressed changes were 
tabulated, and the ratio of expressed to silent changes calculated. These are shown in Drawing 1 . Tables 4 
and 5 contain the data used in this example. 

The branches on the evolutionary tree leading to the primate leptins from their ancestors at the time 
that rodents and primates diverged had an extremely high ratio of expressed to silent changes. From this 
analysis, it was concluded that the biological function of leptins has changed significantly in the primates 
rlative to the function of the leptin in the common ancestor of primates and rodents. 

This approach can be illustrated in a biomedically interesting family of proteins by examining the 
protein leptin, a protein whose mutation in mice is evidently correlated with obesity, and was previously 
known as the "obesity gene protein". The protein has attracted substantial interest in the pharmaceutical 
industry, especially after a human gene encoding a leptin homolog was isolated. According to the 
conventional evolutionary paradigm, because it is a homolog of the mouse leptin, the human leptin must 
also play a role in obesity, and might be an appropriate target for pharmaceutical companies seeking 
human pharmaceuticals to combat this common condition in the first world. 

DNA and protein sequences were retrieved for the genes encoding leptins. A multiple alignment for the 
protein sequences was constructed for the DNA sequences and the protein sequences. Congruent tress for 
both the DNA and protein sequences were then constructed, and sequences at the nodes of the tree 
reconstructed using MacClade [W. P. Maddison, D. R. Maddison, MacClade. Analysis ofPhylogeny and 
Character Evolution. Sinauer Associates, Sunderland MA (1992).] and the known relationship between 
the organisms from which these sequences were derived. For the DNA sequences, the biologically most 
plausible tree proved to be the most parsimonious tree as well. The most parsimonious tree for the protein 
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sequences proved not to be the most plausible tree (by one change) from a biological perspective. The 
DNA tree was taken to be definitive because of its consistency with the biological (cladistic) data. 

A secondary structure prediction was made for the protein family. The evolutionary divergence of the 
sequences available for the leptin family is small - only 21 PAM units (point accepted mutations per 100 
amino acids) - and predictions were biased to favor surface assignments [S. A. Benner, L Badcoe, M. A. 
Cohen and D. L. Gerloff, Bona fide prediction of aspects of protein conformation. Assigning interior and 
surface residues from patterns of variation and conservation in homologous protein sequences. 7. Mol 
Biol 235, 926-958 (1994).]. Thus, positions holding conserved KREND were assigned as surface 
residues, conserved H and Q were assigned to the surface as well, while positions holding conserved CST 
were assigned as uncertain. 

Five separate secondary structural elements were identified. A disulfide bond was presumed to connect 
positions 96 and 146. These secondary structural elements can be accommodated by only a small number 
of overall folds. Interestingly, the pattern of secondary structure in this prediction is consistent with an 
overall fold that resembles that seen in cytokines such as colony stimulating factor [C. P. Hill, T. D. 
Osslund and D. Eisenberg, The structure of granulocyte colony stimulating factor and its relationship to 
other growth factors. Proa Nat Acad ScL 90, 5176-5181 (1993).] and human growth hormone [A. M. 
De Vos, M. Ultsch and A. A. Kossiakoff, Human growth-hormone and extracellular domain of its 
receptor. Crystal-structure of the complex.Science 255, 306-312 (1992).], 

To decide whether evolutionary function may have changed under selective pressure during the 
divergent evolution of the protein family, silent and expressed mutations were assigned to individual 
branches on the evolutionary tree. For each branch of the tree, the sum of the number of silent and 
expressed changes were tabulated, and the ratio of expressed to silent changes calculated. These are 
shown in Drawing 2. 

The branches on the evolutionary tree leading to the primate leptins from their ancestors at the time that 
rodents and primates diverged had an extremely high ratio of expressed to silent changes. From this 
analysis, it was concluded that the biological function of leptins has changed significantly in the primates 
relative to the function of the leptin in the common ancestor of primates and rodents. This conclusion has 
several implications of importance, not the least being for pharmaceutical companies asked whether they 
should explore leptins as a pharmaceutical target. At the very least, it suggests that the mouse is not a good 
pharmacological model for compounds to be tested for their ability to combat obesity in humans. The 
post-genomic analysis suggests that a primate model must be used to test those compounds, with 
implications for the cost of developing an anti-obesity drug based on the leptin protein. 

Intriguingly, a tree can also be built for the leptin receptor. Here, the evolutionary history is not so 
complete. In particular, fewer primate sequences are available for the leptin receptor than for leptin itself. 
Thus, the reconstructed ancestral sequences are less precise with the leptin receptor family, and the 
assignment of expressed and silent mutations to the tree are less certain. Nevertheless, it appears that the 
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leptin receptor has undergone an episode of rapid sequence evolution in the primate half of the family as 
well. The example illustrates how much sequence data is needed (much) to build reliable models of this 
nature, as the ambiguity in the assignment of ancestral sequences makes it possible that the receptor was 
evolving rapidly not only in the lineage leading to primates but also in the lineage leading to mouse. 

Nevertheless, the approximate correlation between the episode of rapid sequence evolution in the leptin 
family and in the leptin receptor family suggests a tool that might become useful in the advanced stages of 
post-genomic science when evolutionary histories are very well articulated. Here, it might be possible to 
detect ligand-receptor relationships between protein families in the database by a correspondence between 
their episodes of rapid sequence evolution. Thus, ligand families should evolve rapidly (in a non- 
Markovian fashion) at the same time in geological history as their receptors evolve. It will be interesting to 
identify more sequences for primate leptin receptors to see if a more complete evolutionary history allows 
us to see more clearly the co-evolution of the leptin receptor and leptin itself. 

Example 5 Alcohol dehydrogenase 

Mammalian alcohol dehydrogenase (E.C.I. 1.1.1) have undergone a rapid episode of sequence 
evolution in and around the active site as substrate specificity has divergently evolved to handle xenobiotic 
substances in the liver. In contrast, over a comparable span of evolutionary distance, the active site of yeast 
alcohol dehydrogenase has changed very little, corresponding to an apparently constant role of the enzyme 
to act on the ethanol-acetaldehyde redox couple. Indeed, by identifying positions in mammalian 
dehydrogenases where amino acid variation was observed over a span of evolution where the same 
residues were conserved in the yeast dehydrogenases provided a clear map of the active site of the protein. 

Example 6 Notch protein 

A set of Notch homologs were obtained, and used to buid a multiple sequence alignment, and 
evolutionary tree (Drawing 6) and reconsructed intermediates throughout the evolutionary tree. 

The functional interpretation based on these tools proceeded as follows. First, the f2 values showed 
that the silent substitutions were not equilibrated over much of the tree. However, the f2 value becomes 
close to 0.5 at points where the phyla diverge, suggesting near equilibration in the silent values. This 
defines the root of the tree near node 13. K a /K s values are given on the branches (numbers in italics). 
They suggest at the level of hypothesis that notch 1, notch 3 and notch are proteins with derived functions, 
while notch 4 is the paralog in mammals with the ancestral function. The rate constant for silent 
substitution is calculated to be ca. 23 x 10* -9 changes/base per hear. This suggests that the notch paralogs 
diverged ca. 400 MYA. This is at the time of the development of advanced organis in vertebrates, 
suggesting that the Notch paralogs with derived function in the vertebrates are important for this level of 
organogenesis. 
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Example 7. C. elegans paralogs 

NED distances are especially useful when comparing paralogs. Here, we need not worry so much 
about codon bias (it has at least been uniform among paralogs at any instant in evolutionary history). For 
example, we used the Master Catalog to identify all families of paralogs in the genome of C. elegans. Ca. 
1250 families of paralogs with four or more members is found. We separated the families into in various 
classes using NED dates. 

(a) Families where duplications all occurred > 400 MYA 

(b) Families where duplications all occurred < 100 MYA 

(c) Families where duplications have been ongoing throughout the past 400 MY. 

(d) Families with duplications in specific episodes. 

(e) Families showing a history of duplication > 400 MYA, but also having more recent episodes of 
recruitment. 



Table 2 presents data from just five of these 1250 families. 



Number of nodes 


generating 


paralogs 


in indicated 


time 


>400 


MYA 0 


-100 


100-200 


200-300 


300-400 


gprod_19987 


3 9 


1 


4 


0 


5 


Mariner transposase 












gprod_31705 


6 


0 


0 


0 


0 


similar to reverse transcriptase 










gprod_32709 


11 


3 


0 


0 


1 


Histone H2A 










2 


gprod_789 4 


5 


2 


0 


0 


No definition line 










3 9 


gprod_19811 


5 


2 


3 


5 


Serine-threonine kinase. 













If the reviewer is a biomedical scientist, the Table immediately suggests ideas. Consider the family 
annotated as a serine-threonine kinase. It has 145 members in the Master Catalog; 55 or these are from 
elegans. The kinases generated by the recent duplications cannot part of the basic developmental plan of 
elegans; this was established 500 MYA. This raises questions: What is it about the serine-threonine 
kinases that recently diverged that might have something to do with recently evolved physiology? We then 
examine the K a /K s value within the Master Catalog trees, all with a click of a mouse button. We 
hypothesize which descendants of recent duplications performing the derived function, and which perform 
the primitive function. Dating the divergence, we try to make statements about changes in nematode 
biology that might be associated with the duplication. These hypotheses can now be tested by experiment 
(knock-outs, in particular). 



description 





A 


B 


c 


D 


E F 




family 


0- 


0.5- 


1.0- 


1.5- 


2.0-sum 


average 




0.5 


1.0 


1.5 


2.0 


2.5 


#char 


gprod_1025 


0 


0 


0 


0 


5 5 


143.4 


gprod_1063 


0 


0 


1 


0 


2 3 


46 


gprod_1069 


0 


0 


1 


0 


2 3 


3 


gprod_10729 


1 


0 


0 


0 


3 4 


143.5 


gprod_10751 


0 


0 


1 


0 


5 6 


204.667 


gprod_1090 


0 


0 


0 


0 


3 3 


35.3333 
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gprod_1110 


0 


0 


1 


2 


1 


4 


gprod_1129 


0 


0 


0 


0 


3 


3 


gprod_11679 


0 


0 


0 


0 


3 


3 


gprod_1240 


0 


0 


0 


0 


4 


4 


gprod_12669 


1 


0 


0 


0 


2 


3 


gprod_1273 


1 


0 


0 


0 


3 


4 


gprod_12815 


0 


0 


0 


0 


3 


3 


gprod_1318 


0 


0 


1 


1 


1 


3 


gprod„13259 


0 


0 


1 


1 


6 


8 


gprod_1354 


0 


0 


0 


0 


3 


3 


gprod_13591 


0 


0 


0 


0 


5 


5 


gprod_1405 


0 


0 


0 


1 


1 


2 


gprod_14189 


0 


0 


2 


0 


14 


16 


gprod_1468 


0 


0 


0 


0 


2 


2 


gprod_1471 


0 


0 


0 


0 


2 


2 


gprod_15094 


0 


0 


0 


0 


4 


4 


gprod_15198 


0 


0 


1 


0 


2 


3 


gprod„15375 


0 


0 


0 


0 


5 


5 


gprod_15390 


1 


0 


0 


3 


13 


17 


gprod_15452 


0 


0 


0 


1 


11 


12 


gprod_15464 


2 


0 


0 


0 


5 


7 


gprod_15559 


0 


0 


0 


0 


2 


2 


gprod_15565 


0 


0 


1 


0 


5 


6 


gprod_15577 


0 


0 


0 


0 


7 


7 


gprod_15586 


1 


0 


0 


0 


2 


3 


gprod_15588 


0 


0 


1 


0 


4 


5 


gprod„15724 


1 


0 


0 


0 


2 


3 


gprod_15801 


0 


0 


0 


1 


2 


3 


gprod_15805 


0 


0 


0 


2 


3 


5 


gprod_15819 


0 


0 


2 


0 


5 


7 


gprod_15877 


0 


0 


1 


1 


1 


3 


gprod_15878 


2 


0 


0 


1 


0 


3 


gprod_15899 


0 


0 


0 


0 


3 


3 


gprod_15929 


0 


0 


3 


2 


5 


10 


gprod_15937 


0 


0 


0 


1 


2 


3 


gprod_15971 


0 


0 


1 


0 


0 


1 


gprod_15974 


1 


0 


1 


0 


2 


4 


gprod_16058 


2 


0 


0 


0 


0 


2 


gprod_16059 


0 


0 


0 


1 


1 


2 


gprod_16306 


4 


0 


0 


0 


0 


4 


gprod_16402 


1 


0 


0 


0 


4 


5 


gprod_16477 


0 


0 


0 


0 


0 


0 


gprod_16653 


0 


0 


0 


1 


2 


3 


gprod_16715 


0 


0 


0 


0 


3 


3 


gprod_1689 


0 


0 


1 


0 


3 


4 


gprod_16897 


0 


0 


0 


0 


4 


4 


gprod_16898 


0 


0 


1 


1 


3 


5 


gpnxL17379 


2 


0 


1 


0 


0 


3 


gprod_1740 


0 


0 


0 


1 


3 


4 


gprod_17677 


0 


0 


0 


0 


4 


4 


gprod_1825 


0 


0 


0 


0 


4 


4 


gprod_19415 


0 


0 


0 


0 


4 


4 


gprod_19478 


0 


0 


0 


1 


3 


4 


gprod_19775 


0 


0 


0 


0 


1 


1 


gprod_19789 


0 


0 


0 


1 


3 


4 


gprod„19814 


2 


0 


1 


0 


3 


6 


gprod_19828 


0 


0 


1 


0 


8 


9 


gprod_19849 


0 


0 


0 


0 


3 


3 


gprod_19867 


2 


0 


1 


1 


12 


16 



55 



151.75 

89.6667 

48.3333 
146 

35 
153 

36 

59.6667 

55.875 
152 
116.8 
222 

601.625 similar to tubulin alpha-2 chain 
18 
95 

56 adenylate kinase 

4 1 .6667 4-nitrophenylphosphatases 

42.8 Ce mostly, similar to triacylgycerol lipases 

916.529 guanylate cyclases, cat. domain protein kinases 
616.75 weak similarity to nodulation protein X 
242.429 serine protease inhibitor 
272 clathrin coat assembly like protein 

68 

37 

35.3333 
135.2 
60.3333 

157. 667 elongation factor EF 
330.4 

169 putative integral membrane transport protein 

11.6667 

1032 glyceraldehyde 3-phosphate dehydrogenase 

61 

108.5 
31 
216 

29.5 
651.5 
121 
1220 
81.4 
0 

73.6667 
81 

102.25 

80.25 

40.2 
502 

91.75 

72.5 
162.5 
342 

34.25 
460 
205.5 
183.833 
115.222 

80.3333 
227.625 
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gprod_19899 


0 


0 


gprod_19925 


0 


0 


gprod_19926 


0 


0 


gprod_19931 


0 


0 


gprod_19938 


1 


0 


gprod„19967 


0 


0 


gprod_19971 


4 


0 


gprod_19979 


0 


0 


gprod_19983 


0 


0 


gprod_20015 


4 


0 


gprod_20024 


0 


0 


gprod_20031 


1 


0 


gprod_20077 


2 


0 


gprod_20083 


3 


0 


gprod_20110 


0 


0 


gprod_20113 


0 


0 


gprod_20115 


0 


0 


gprod_20122 


0 


0 


gprod_20124 


1 


0 


gprod_20125 


0 


0 


gprod_20126 


0 


0 


gprod_20154 


0 


0 


gprod_20156 


0 


0 


gprod_20169 


0 


0 


gprod_20173 


1 


0 


gprod_20188 


0 


0 


gprod_20196 


0 


0 


gprod„20238 


0 


0 


gprod_20244 


0 


0 


gprod„20245 


0 


0 


gprod_20270 


0 


0 


gprod_20295 


1 


0 


gprod_20307 


1 


0 


gprod_20367 


0 


0 


gprod_20414 


0 


0 


gprod_20418 


0 


0 


gprod_20535 


0 


0 


gprod_20557 


0 


0 


gprod_20558 


0 


0 


gprod_20569 


1 


0 


gprod_20570 


2 


0 


gprod_20576 


0 


0 


gprod_20591 


3 


0 


gprod_20624 


0 


0 


gprod_20628 


0 


0 


gprod_20641 


0 


0 


gprod_20662 


0 


0 


gprod_20664 


0 


0 


gprod_20680 


0 


0 


gprod_20700 


2 


0 


gprod_20727 


4 


0 


gprod_20734 


0 


0 


gprod_20741 


0 


0 


gprod_20765 


0 


0 


gprod_20828 


1 


0 


gprod_20829 


0 


0 


gprod_20837 


1 


0 


gprod_20852 


0 


0 


gprod_20893 


0 


0 



1 


2 


4 


7 


100.143 


0 


0 


3 


3 


20.3333 


1 


1 


7 


9 


121.556 


0 


0 


0 


0 


0 


2 


0 


0 


3 


1151.33 


0 


0 


1 


1 


276 


0 


0 


3 


7 


418.571 


0 


0 


1 


1 


200 


0 


0 


7 


7 


116.429 


0 


1 


7 


12 


989.333 


1 


0 


2 


3 


61.3333 


2 


0 


2 


5 


39.6 


0 


0 


1 


3 


1050.67 


0 


0 


0 


3 


438 


0 


0 


1 


1 


212 


1 


2 


4 


7 


227.286 


0 


0 


6 


6 


213.833 


0 


0 


6 


6 


108.5 


3 


0 


9 


13 


95.3846 


1 


0 


2 


3 


79.3333 


4 


0 


7 


11 


62.8182 


0 


1 


5 


6 


138 


1 


2 


3 


6 


91 


4 


4 


11 


19 


939.842 


2 


0 


0 


3 


491.667 


1 


1 


7 


9 


95.8889 


1 


0 


7 


8 


197.5 


0 


0 


5 


5 


118.8 


0 


0 


3 


3 


82 


1 


0 


2 


3 


25.3333 


0 


0 


2 


2 


77 


1 


0 


7 


9 


274.111 


1 


0 


1 


3 


22 


0 


0 


2 


2 


97 


0 


2 


1 


3 


125.333 


1 


0 


2 


3 


154.333 


0 


0 


2 


2 


399 


0 


0 


1 


1 


140 


0 


0 


4 


4 
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One observation apparent from the Table is that genes that have multiple recent recruitments in C. 
elegans are unlikely to have clearly identifiable homologs in other phyla, while those that have few recent 
recruitments are more likely than average to have clearly identifiable homologs in other phyla. 
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CLAIMS 

What is claimed is: 

1 . A method for constructing hypotheses concerning the function of protein families based on an 
analysis of the non-stochastic behavior of sequence evolution. 
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Abstract of the Disclosure 



This disclosure describes the classes of tools that permit a scientist to generate experimentally testable 
hypotheses concerning the function of a protein starting from an evolutionary analysis. These are outlined 
below: 

L Tools that detect change in function within a family of proteins. 

A. Ratios of silent to non-silent substitution along specific branches of an evolutionary tree 
including tools that address normalization issues. 

B. Covarion behavior, in which individual residues display different mutability in different 
branches of a tree. 

C. Detecting high absolute rates of amino acid substitution, changes per unit time. 
II. Tools that detect conservation of function within a family of proteins. 

A. Compensatory changes 

B. Homoplasy 

C Absolute conservation within a defined evolutionary distance 
EI. Tools that identify individual residues involved in changes in functionally significant behavior. 

A. Residues changing in episodes with high Ka/Ks values, minus residues changing in episodes 
with low Ka/Ks values 

B. Residues displaying covarion behavior 

C. Mapping these residues on to models for the secondary, tertiary, and quaternary structure of 
proteins. 

IV. Tools that identify individual residues involved in conserved of functionally significant behavior 

A. Residues suffering compensatory changes 

B. Residues displaying homoplasy 

C. Mapping these residues on to models for the secondary, tertiary, and quaternary structure of 
proteins. 

V. Tools that involve correlation between the evolutionary histories of two families of proteins 

A. Correlating the topology of evolutionary trees in two families of proteins 

B. Correlating the connectivity of proteins in a gene family 

C. Dating events in the molecular history 

D. Correlating evolutionary events in two protein families occuring at approximately the same 
time 

R Correlating evolutionary events in two protein families that are associated with analogous 
behavior involving expressed/silent ratios 

VI. Tools that involve correlation between the evolutionary history of a family of proteins and the 
evolutionary history of the organism as known from some source other than genomic sequence data, 
including paleontology, geology, ecology, ontogeny, phylogeny, or systematics (collectively known as 
the "non-genomic record". 

A. Correlating the topology of an evolutionary trees and the non-genomic record. 

B. Correlating features of patterns of evolution in specific branches in the evolutionary tree with 
the non-genomic record 

C. Correlating evolutionary events in several protein families occuring at approximately the same 
time with the non-genomic record 
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Drawings 

Drawing 1 
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Drawing 2 
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Drawing 3 
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Drawing 4 




Steven A. Benner 

12 Mus mus cuius (mouse) 

13 Rattus norvegicus (rat) 

7 Sus scrofa (pig) placenta, isoform II 

8 Sus scrofa (pig) embryo, isoform III 

9 Sus scrofa (pig) ovary, isoform I 

10 Bos taurus (ox) 

1 1 Equus caballus (horse) 

15 Homo sapiens (human) 

14 Oryctolagus cuniculus (rabbit) 
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Drawing 5 
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Declaration for Patent Application 



As a below named inventor, I declare that: 

My residence, post office address, and citizenship are as stated next to my name. 

I believe that I am the original, first, and sole inventor of the subject matter that is 
claimed and for which a patent is sought on the invention entitled: "Evolution-based 
Functional Genomics" , the specification of which is attached hereto. I also declare that I 
am a "small entity" under the US Code. 

I hereby state that I have reviewed and understand the contents of the above identified 
specification, including the claims. 

I acknowledge my duty to disclose information that is material to the examination of 
this application in accordance with Title 37, Code of Federal Regulations, Section 1.56 
(a). 

I hereby claim the benefit under Title 35, United States Code, §120 of any United 
States application(s) listed below and, insofar as the subject matter of each of the 
claims of this application is not disclosed in the prior United States application in the 
manner provided by the first paragraph of Title 35, United States Code, §112, 1 
acknowledge the duty to disclose all information I know to be material to 
patentability as defined in Title 37, Code of Federal Regulations, § 1.56(a) which 
became available between the filing date of the prior application and the national or 
PCT international filing date of this application: 

U.S. SERIAL NO. FILING DATE STATUS 

07/857,224 03/25/92 issued US 5,958,784 

08/914,375 08/08/97 Pending 

Address all telephone calls to Steven A. Benner at telephone number (352) 332-2262. 

Address all correspondence to Steven A. Benner, 1501 N.W. 68th Terrace, 
Gainesville, FL 32605-4147. 

I hereby declare that all statements made herein of my own knowledge are true and that 
all statements made on information and belief are believed to be true; and further that 
these statements were made with the knowledge that willful false statements and the like 
so made are punishable by fine or imprisonment, or both, under Section 1001 of Title 18 
of the United States Code and that such willful false statements may jeopardize the 
validity of the application or any patent issued thereon. 

Full name of sole inventor: Steven iUbert Benner 

Inventor's signature:^ ^ 

Residence/Postal addressflSOl N.W. 68th Terrace, Gainesville FL 32605 
Citizenship: United States of America 

Before me this 31st day of December, 1996 personally appeared Steven Albert Benner, 
who acknowledged the foregoing instrument to be his free act and deed. 
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